summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2022-06-14xsk: Fix possible crash when multiple sockets are createdMaciej Fijalkowski
commit ba3beec2ec1d3b4fd8672ca6e781dac4b3267f6e upstream. Fix a crash that happens if an Rx only socket is created first, then a second socket is created that is Tx only and bound to the same umem as the first socket and also the same netdev and queue_id together with the XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page pool was not created by the first socket as it was an Rx only socket. When the second socket is bound it needs this tx_descs array of this shared page pool as it has a Tx component, but unfortunately it was never allocated, leading to a crash. Note that this array is only used for zero-copy drivers using the batched Tx APIs, currently only ice and i40e. [ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 5511.158419] #PF: supervisor write access in kernel mode [ 5511.164472] #PF: error_code(0x0002) - not-present page [ 5511.170416] PGD 0 P4D 0 [ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI [ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 5.18.0-rc1+ #97 [ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016 [ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310 [ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b [ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246 [ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000 [ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0 [ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48 [ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000 [ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100 [ 5511.274387] FS: 0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000 [ 5511.283746] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0 [ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5511.315142] Call Trace: [ 5511.317972] <IRQ> [ 5511.320301] ice_xmit_zc+0x68/0x2f0 [ice] [ 5511.324977] ? ktime_get+0x38/0xa0 [ 5511.328913] ice_napi_poll+0x7a/0x6a0 [ice] [ 5511.333784] __napi_poll+0x2c/0x160 [ 5511.337821] net_rx_action+0xdd/0x200 [ 5511.342058] __do_softirq+0xe6/0x2dd [ 5511.346198] irq_exit_rcu+0xb5/0x100 [ 5511.350339] common_interrupt+0xa4/0xc0 [ 5511.354777] </IRQ> [ 5511.357201] <TASK> [ 5511.359625] asm_common_interrupt+0x1e/0x40 [ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360 [ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49 [ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202 [ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f [ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046 [ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008 [ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700 [ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000 [ 5511.480300] cpuidle_enter+0x29/0x40 [ 5511.494329] do_idle+0x1c7/0x250 [ 5511.507610] cpu_startup_entry+0x19/0x20 [ 5511.521394] start_kernel+0x649/0x66e [ 5511.534626] secondary_startup_64_no_verify+0xc3/0xcb [ 5511.549230] </TASK> Detect such case during bind() and allocate this memory region via newly introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as for other buffer pool allocations, so that it matches the kvfree() from xp_destroy(). Fixes: d1bc532e99be ("i40e: xsk: Move tmp desc array from driver to pool") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14i40e: xsk: Move tmp desc array from driver to poolMagnus Karlsson
[ Upstream commit d1bc532e99becf104635ed4da6fefa306f452321 ] Move desc_array from the driver to the pool. The reason behind this is that we can then reuse this array as a temporary storage for descriptors in all zero-copy drivers that use the batched interface. This will make it easier to add batching to more drivers. i40e is the only driver that has a batched Tx zero-copy implementation, so no need to touch any other driver. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14netfilter: nf_tables: bail out early if hardware offload is not supportedPablo Neira Ayuso
[ Upstream commit 3a41c64d9c1185a2f3a184015e2a9b78bfc99c71 ] If user requests for NFT_CHAIN_HW_OFFLOAD, then check if either device provides the .ndo_setup_tc interface or there is an indirect flow block that has been registered. Otherwise, bail out early from the preparation phase. Moreover, validate that family == NFPROTO_NETDEV and hook is NF_NETDEV_INGRESS. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14netfilter: nf_tables: delete flowtable hooks via transaction listPablo Neira Ayuso
[ Upstream commit b6d9014a3335194590abdd2a2471ef5147a67645 ] Remove inactive bool field in nft_hook object that was introduced in abadb2f865d7 ("netfilter: nf_tables: delete devices from flowtable"). Move stale flowtable hooks to transaction list instead. Deleting twice the same device does not result in ENOENT. Fixes: abadb2f865d7 ("netfilter: nf_tables: delete devices from flowtable") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14bluetooth: don't use bitmaps for random flag accessesLinus Torvalds
[ Upstream commit e1cff7002b716bd0b5f5f4afd4273c99aa8644be ] The bluetooth code uses our bitmap infrastructure for the two bits (!) of connection setup flags, and in the process causes odd problems when it converts between a bitmap and just the regular values of said bits. It's completely pointless to do things like bitmap_to_arr32() to convert a bitmap into a u32. It shoudln't have been a bitmap in the first place. The reason to use bitmaps is if you have arbitrary number of bits you want to manage (not two!), or if you rely on the atomicity guarantees of the bitmap setting and clearing. The code could use an "atomic_t" and use "atomic_or/andnot()" to set and clear the bit values, but considering that it then copies the bitmaps around with "bitmap_to_arr32()" and friends, there clearly cannot be a lot of atomicity requirements. So just use a regular integer. In the process, this avoids the warnings about erroneous use of bitmap_from_u64() which were triggered on 32-bit architectures when conversion from a u64 would access two words (and, surprise, surprise, only one word is needed - and indeed overkill - for a 2-bit bitmap). That was always problematic, but the compiler seems to notice it and warn about the invalid pattern only after commit 0a97953fd221 ("lib: add bitmap_{from,to}_arr64") changed the exact implementation details of 'bitmap_from_u64()', as reported by Sudip Mukherjee and Stephen Rothwell. Fixes: fe92ee6425a2 ("Bluetooth: hci_core: Rework hci_conn_params flags") Link: https://lore.kernel.org/all/YpyJ9qTNHJzz0FHY@debian/ Link: https://lore.kernel.org/all/20220606080631.0c3014f2@canb.auug.org.au/ Link: https://lore.kernel.org/all/20220605162537.1604762-1-yury.norov@gmail.com/ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14ax25: Fix ax25 session cleanup problemsDuoming Zhou
[ Upstream commit 7d8a3a477b3e25ada8dc71d22048c2ea417209a0 ] There are session cleanup problems in ax25_release() and ax25_disconnect(). If we setup a session and then disconnect, the disconnected session is still in "LISTENING" state that is shown below. Active AX.25 sockets Dest Source Device State Vr/Vs Send-Q Recv-Q DL9SAU-4 DL9SAU-3 ??? LISTENING 000/000 0 0 DL9SAU-3 DL9SAU-4 ??? LISTENING 000/000 0 0 The first reason is caused by del_timer_sync() in ax25_release(). The timers of ax25 are used for correct session cleanup. If we use ax25_release() to close ax25 sessions and ax25_dev is not null, the del_timer_sync() functions in ax25_release() will execute. As a result, the sessions could not be cleaned up correctly, because the timers have stopped. In order to solve this problem, this patch adds a device_up flag in ax25_dev in order to judge whether the device is up. If there are sessions to be cleaned up, the del_timer_sync() in ax25_release() will not execute. What's more, we add ax25_cb_del() in ax25_kill_by_device(), because the timers have been stopped and there are no functions that could delete ax25_cb if we do not call ax25_release(). Finally, we reorder the position of ax25_list_lock in ax25_cb_del() in order to synchronize among different functions that call ax25_cb_del(). The second reason is caused by improper check in ax25_disconnect(). The incoming ax25 sessions which ax25->sk is null will close heartbeat timer, because the check "if(!ax25->sk || ..)" is satisfied. As a result, the session could not be cleaned up properly. In order to solve this problem, this patch changes the improper check to "if(ax25->sk && ..)" in ax25_disconnect(). What`s more, the ax25_disconnect() may be called twice, which is not necessary. For example, ax25_kill_by_device() calls ax25_disconnect() and sets ax25->state to AX25_STATE_0, but ax25_release() calls ax25_disconnect() again. In order to solve this problem, this patch add a check in ax25_release(). If the flag of ax25->sk equals to SOCK_DEAD, the ax25_disconnect() in ax25_release() should not be executed. Fixes: 82e31755e55f ("ax25: Fix UAF bugs in ax25 timers") Fixes: 8a367e74c012 ("ax25: Fix segfault after sock connection timeout") Reported-and-tested-by: Thomas Osterried <thomas@osterried.de> Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Link: https://lore.kernel.org/r/20220530152158.108619-1-duoming@zju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14net: sched: add barrier to fix packet stuck problem for lockless qdiscGuoju Fang
[ Upstream commit 2e8728c955ce0624b958eee6e030a37aca3a5d86 ] In qdisc_run_end(), the spin_unlock() only has store-release semantic, which guarantees all earlier memory access are visible before it. But the subsequent test_bit() has no barrier semantics so may be reordered ahead of the spin_unlock(). The store-load reordering may cause a packet stuck problem. The concurrent operations can be described as below, CPU 0 | CPU 1 qdisc_run_end() | qdisc_run_begin() . | . ----> /* may be reorderd here */ | . | . | . | spin_unlock() | set_bit() | . | smp_mb__after_atomic() ---- test_bit() | spin_trylock() . | . Consider the following sequence of events: CPU 0 reorder test_bit() ahead and see MISSED = 0 CPU 1 calls set_bit() CPU 1 calls spin_trylock() and return fail CPU 0 executes spin_unlock() At the end of the sequence, CPU 0 calls spin_unlock() and does nothing because it see MISSED = 0. The skb on CPU 1 has beed enqueued but no one take it, until the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it. This patch fix this by adding one explicit barrier. As spin_unlock() and test_bit() ordering is a store-load ordering, a full memory barrier smp_mb() is needed here. Fixes: a90c57f2cedd ("net: sched: fix packet stuck problem for lockless qdisc") Signed-off-by: Guoju Fang <gjfang@linux.alibaba.com> Link: https://lore.kernel.org/r/20220528101628.120193-1-gjfang@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14tcp: add accessors to read/set tp->snd_cwndEric Dumazet
[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ] We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero. Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction") can trigger, and without a repro we would have to spend considerable time finding the bug. Instead of complaining too late, we want to catch where and when tp->snd_cwnd is set to an illegal value. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-14net: sched: fixed barrier to prevent skbuff sticking in qdisc backlogVincent Ray
[ Upstream commit a54ce3703613e41fe1d98060b62ec09a3984dc28 ] In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit() does not provide any ordering guarantee as test_bit() is not an atomic operation. This, added to the fact that the spin_trylock() call at the beginning of qdisc_run_begin() does not guarantee acquire semantics if it does not grab the lock, makes it possible for the following statement : if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) to be executed before an enqueue operation called before qdisc_run_begin(). As a result the following race can happen : CPU 1 CPU 2 qdisc_run_begin() qdisc_run_begin() /* true */ set(MISSED) . /* returns false */ . . /* sees MISSED = 1 */ . /* so qdisc not empty */ . __qdisc_run() . . . pfifo_fast_dequeue() ----> /* may be done here */ . | . clear(MISSED) | . . | . smp_mb __after_atomic(); | . . | . /* recheck the queue */ | . /* nothing => exit */ | enqueue(skb1) | . | qdisc_run_begin() | . | spin_trylock() /* fail */ | . | smp_mb__before_atomic() /* not enough */ | . ---- if (test_bit(MISSED)) return false; /* exit */ In the above scenario, CPU 1 and CPU 2 both try to grab the qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the bypass code path, where it emits its skb then calls __qdisc_run(). CPU1 fails, sets MISSED and goes down the traditionnal enqueue() + dequeue() code path. But when executing qdisc_run_begin() for the second time, after enqueuing its skbuff, it sees the MISSED bit still set (by itself) and consequently chooses to exit early without setting it again nor trying to grab the spinlock again. Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue and found it empty, so it returned. At the end of the sequence, we end up with skb1 enqueued in the backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set, and no __netif_schedule() called made. skb1 will now linger in the qdisc until somebody later performs a full __qdisc_run(). Associated to the bypass capacity of the qdisc, and the ability of the TCP layer to avoid resending packets which it knows are still in the qdisc, this can lead to serious traffic "holes" in a TCP connection. We fix this by replacing the smp_mb__before_atomic() / test_bit() / set_bit() / smp_mb__after_atomic() sequence inside qdisc_run_begin() by a single test_and_set_bit() call, which is more concise and enforces the needed memory barriers. Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for lockless qdisc") Signed-off-by: Vincent Ray <vray@kalrayinc.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220526001746.2437669-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-09Bluetooth: hci_sync: Cleanup hci_conn if it cannot be abortedLuiz Augusto von Dentz
[ Upstream commit 9b3628d79b46f06157affc56fdb218fdd4988321 ] This attempts to cleanup the hci_conn if it cannot be aborted as otherwise it would likely result in having the controller and host stack out of sync with respect to connection handle. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-09Bluetooth: HCI: Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirkLuiz Augusto von Dentz
[ Upstream commit 05abad857277dda198063017b00ba5b9fed2c0cb ] This adds HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk which can be used to mark HCI_Enhanced_Setup_Synchronous_Connection as broken even if its support command bit are set since some controller report it as supported but the command don't work properly with some configurations (e.g. BT_VOICE_TRANSPARENT/mSBC). Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-09ipv6: fix locking issues with loops over idev->addr_listNiels Dossche
[ Upstream commit 51454ea42c1ab4e0c2828bb0d4d53957976980de ] idev->addr_list needs to be protected by idev->lock. However, it is not always possible to do so while iterating and performing actions on inet6_ifaddr instances. For example, multiple functions (like addrconf_{join,leave}_anycast) eventually call down to other functions that acquire the idev->lock. The current code temporarily unlocked the idev->lock during the loops, which can cause race conditions. Moving the locks up is also not an appropriate solution as the ordering of lock acquisition will be inconsistent with for example mc_lock. This solution adds an additional field to inet6_ifaddr that is used to temporarily add the instances to a temporary list while holding idev->lock. The temporary list can then be traversed without holding idev->lock. This change was done in two places. In addrconf_ifdown, the list_for_each_entry_safe variant of the list loop is also no longer necessary as there is no deletion within that specific loop. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220403231523.45843-1-dossche.niels@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-06netfilter: conntrack: re-fetch conntrack after insertionFlorian Westphal
commit 56b14ecec97f39118bf85c9ac2438c5a949509ed upstream. In case the conntrack is clashing, insertion can free skb->_nfct and set skb->_nfct to the already-confirmed entry. This wasn't found before because the conntrack entry and the extension space used to free'd after an rcu grace period, plus the race needs events enabled to trigger. Reported-by: <syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com> Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Fixes: 2ad9d7747c10 ("netfilter: conntrack: free extension area immediately") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-25xfrm: fix "disable_policy" flag use when arriving from different devicesEyal Birger
[ Upstream commit e6175a2ed1f18bf2f649625bf725e07adcfa6a28 ] In IPv4 setting the "disable_policy" flag on a device means no policy should be enforced for traffic originating from the device. This was implemented by seting the DST_NOPOLICY flag in the dst based on the originating device. However, dsts are cached in nexthops regardless of the originating devices, in which case, the DST_NOPOLICY flag value may be incorrect. Consider the following setup: +------------------------------+ | ROUTER | +-------------+ | +-----------------+ | | ipsec src |----|-|ipsec0 | | +-------------+ | |disable_policy=0 | +----+ | | +-----------------+ |eth1|-|----- +-------------+ | +-----------------+ +----+ | | noipsec src |----|-|eth0 | | +-------------+ | |disable_policy=1 | | | +-----------------+ | +------------------------------+ Where ROUTER has a default route towards eth1. dst entries for traffic arriving from eth0 would have DST_NOPOLICY and would be cached and therefore can be reused by traffic originating from ipsec0, skipping policy check. Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead of the DST in IN/FWD IPv4 policy checks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-25xfrm: rework default policy structureNicolas Dichtel
[ Upstream commit b58b1f563ab78955d37e9e43e02790a85c66ac05 ] This is a follow up of commit f8d858e607b2 ("xfrm: make user policy API complete"). The goal is to align userland API to the internal structures. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-18secure_seq: use the 64 bits of the siphash for port offset calculationWilly Tarreau
[ Upstream commit b2d057560b8107c633b39aabe517ff9d93f285e3 ] SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit 7cd23e5300c1 ("secure_seq: use SipHash in place of MD5"), but the output remained truncated to 32-bit only. In order to exploit more bits from the hash, let's make the functions return the full 64-bit of siphash_3u32(). We also make sure the port offset calculation in __inet_hash_connect() remains done on 32-bit to avoid the need for div_u64_rem() and an extra cost on 32-bit systems. Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-18net/sched: act_pedit: really ensure the skb is writablePaolo Abeni
[ Upstream commit 8b796475fd7882663a870456466a4fb315cc1bd6 ] Currently pedit tries to ensure that the accessed skb offset is writable via skb_unclone(). The action potentially allows touching any skb bytes, so it may end-up modifying shared data. The above causes some sporadic MPTCP self-test failures, due to this code: tc -n $ns2 filter add dev ns2eth$i egress \ protocol ip prio 1000 \ handle 42 fw \ action pedit munge offset 148 u8 invert \ pipe csum tcp \ index 100 The above modifies a data byte outside the skb head and the skb is a cloned one, carrying a TCP output packet. This change addresses the issue by keeping track of a rough over-estimate highest skb offset accessed by the action and ensuring such offset is really writable. Note that this may cause performance regressions in some scenarios, but hopefully pedit is not in the critical path. Fixes: db2c24175d14 ("act_pedit: access skb->data safely") Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/1fcf78e6679d0a287dd61bb0f04730ce33b3255d.1652194627.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-15Bluetooth: Fix the creation of hdev->nameItay Iellin
commit 103a2f3255a95991252f8f13375c3a96a75011cd upstream. Set a size limit of 8 bytes of the written buffer to "hdev->name" including the terminating null byte, as the size of "hdev->name" is 8 bytes. If an id value which is greater than 9999 is allocated, then the "snprintf(hdev->name, sizeof(hdev->name), "hci%d", id)" function call would lead to a truncation of the id value in decimal notation. Set an explicit maximum id parameter in the id allocation function call. The id allocation function defines the maximum allocated id value as the maximum id parameter value minus one. Therefore, HCI_MAX_ID is defined as 10000. Signed-off-by: Itay Iellin <ieitayie@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-09Bluetooth: hci_event: Fix checking for invalid handle on error statusLuiz Augusto von Dentz
[ Upstream commit c86cc5a3ec70f5644f1fa21610b943d0441bc1f7 ] Commit d5ebaa7c5f6f6 introduces checks for handle range (e.g HCI_CONN_HANDLE_MAX) but controllers like Intel AX200 don't seem to respect the valid range int case of error status: > HCI Event: Connect Complete (0x03) plen 11 Status: Page Timeout (0x04) Handle: 65535 Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) Link type: ACL (0x01) Encryption: Disabled (0x00) [1644965.827560] Bluetooth: hci0: Ignoring HCI_Connection_Complete for invalid handle Because of it is impossible to cleanup the connections properly since the stack would attempt to cancel the connection which is no longer in progress causing the following trace: < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6 Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) = bluetoothd: src/profile.c:record_cb() Unable to get Hands-Free Voice gateway SDP record: Connection timed out > HCI Event: Command Complete (0x0e) plen 10 Create Connection Cancel (0x01|0x0008) ncmd 1 Status: Unknown Connection Identifier (0x02) Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6 Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) Fixes: d5ebaa7c5f6f6 ("Bluetooth: hci_event: Ignore multiple conn complete events") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-09tcp: make sure treq->af_specific is initializedEric Dumazet
[ Upstream commit ba5a4fdd63ae0c575707030db0b634b160baddd7 ] syzbot complained about a recent change in TCP stack, hitting a NULL pointer [1] tcp request sockets have an af_specific pointer, which was used before the blamed change only for SYNACK generation in non SYNCOOKIE mode. tcp requests sockets momentarily created when third packet coming from client in SYNCOOKIE mode were not using treq->af_specific. Make sure this field is populated, in the same way normal TCP requests sockets do in tcp_conn_request(). [1] TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies. Check SNMP counters. general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 1 PID: 3695 Comm: syz-executor864 Not tainted 5.18.0-rc3-syzkaller-00224-g5fd1fe4807f9 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:tcp_create_openreq_child+0xe16/0x16b0 net/ipv4/tcp_minisocks.c:534 Code: 48 c1 ea 03 80 3c 02 00 0f 85 e5 07 00 00 4c 8b b3 28 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7e 08 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 c9 07 00 00 48 8b 3c 24 48 89 de 41 ff 56 08 48 RSP: 0018:ffffc90000de0588 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff888076490330 RCX: 0000000000000100 RDX: 0000000000000001 RSI: ffffffff87d67ff0 RDI: 0000000000000008 RBP: ffff88806ee1c7f8 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff87d67f00 R11: 0000000000000000 R12: ffff88806ee1bfc0 R13: ffff88801b0e0368 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f517fe58700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffcead76960 CR3: 000000006f97b000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> tcp_v6_syn_recv_sock+0x199/0x23b0 net/ipv6/tcp_ipv6.c:1267 tcp_get_cookie_sock+0xc9/0x850 net/ipv4/syncookies.c:207 cookie_v6_check+0x15c3/0x2340 net/ipv6/syncookies.c:258 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1131 [inline] tcp_v6_do_rcv+0x1148/0x13b0 net/ipv6/tcp_ipv6.c:1486 tcp_v6_rcv+0x3305/0x3840 net/ipv6/tcp_ipv6.c:1725 ip6_protocol_deliver_rcu+0x2e9/0x1900 net/ipv6/ip6_input.c:422 ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:464 NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:473 dst_input include/net/dst.h:461 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ipv6_rcv+0x27f/0x3b0 net/ipv6/ip6_input.c:297 __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5405 __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5519 process_backlog+0x3a0/0x7c0 net/core/dev.c:5847 __napi_poll+0xb3/0x6e0 net/core/dev.c:6413 napi_poll net/core/dev.c:6480 [inline] net_rx_action+0x8ec/0xc60 net/core/dev.c:6567 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097 Fixes: 5b0b9e4c2c89 ("tcp: md5: incorrect tcp_header_len for incoming connections") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-09tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWATEric Dumazet
[ Upstream commit 4bfe744ff1644fbc0a991a2677dc874475dd6776 ] I had this bug sitting for too long in my pile, it is time to fix it. Thanks to Doug Porter for reminding me of it! We had various attempts in the past, including commit 0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"), but the issue is that TCP stack currently only generates EPOLLOUT from input path, when tp->snd_una has advanced and skb(s) cleaned from rtx queue. If a flow has a big RTT, and/or receives SACKs, it is possible that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0 and no more data can be sent until tp->snd_una finally advances. What is needed is to also check if POLLOUT needs to be generated whenever tp->snd_nxt is advanced, from output path. This bug triggers more often after an idle period, as we do not receive ACK for at least one RTT. tcp_notsent_lowat could be a fraction of what CWND and pacing rate would allow to send during this RTT. In a followup patch, I will remove the bogus call to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED) from tcp_check_space(). Fact that we have decided to generate an EPOLLOUT does not mean the application has immediately refilled the transmit queue. This optimistic call might have been the reason the bug seemed not too serious. Tested: 200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2] $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat $ cat bench_rr.sh SUM=0 for i in {1..10} do V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"` echo $V SUM=$(($SUM + $V)) done echo SUM=$SUM Before patch: $ bench_rr.sh 130000000 80000000 140000000 140000000 140000000 140000000 130000000 40000000 90000000 110000000 SUM=1140000000 After patch: $ bench_rr.sh 430000000 590000000 530000000 450000000 450000000 350000000 450000000 490000000 480000000 460000000 SUM=4680000000 # This is 410 % of the value before patch. Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Doug Porter <dsp@fb.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-09ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md modePeilin Ye
[ Upstream commit 31c417c948d7f6909cb63f0ac3298f3c38f8ce20 ] As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in collect_md mode is racy for [IP6]GRE[TAP] devices. Consider the following sequence of events: 1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link add ... external". "ip" ignores "[o]seq" if "external" is specified, so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e. it uses lockless TX); 2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g. bpf_skb_set_tunnel_key() in an eBPF program attached to this device; 3. gre_fb_xmit() or __gre6_xmit() processes these skb's: gre_build_header(skb, tun_hlen, flags, protocol, tunnel_id_to_key32(tun_info->key.tun_id), (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0); ^^^^^^^^^^^^^^^^^ Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may try to do this tunnel->o_seqno++ in parallel, which is racy. Fix it by making o_seqno atomic_t. As mentioned by Eric Dumazet in commit b790e01aee74 ("ip_gre: lockless xmit"), making o_seqno atomic_t increases "chance for packets being out of order at receiver" when NETIF_F_LLTX is on. Maybe a better fix would be: 1. Do not ignore "oseq" in external mode. Users MUST specify "oseq" if they want the kernel to allow sequencing of outgoing packets; 2. Reject all outgoing TUNNEL_SEQ packets if the device was not created with "oseq". Unfortunately, that would break userspace. We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us do it in separate patches to keep this fix minimal. Suggested-by: Jakub Kicinski <kuba@kernel.org> Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.") Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-09tcp: ensure to use the most recently sent skb when filling the rate samplePengcheng Yang
[ Upstream commit b253a0680ceadc5d7b4acca7aa2d870326cad8ad ] If an ACK (s)acks multiple skbs, we favor the information from the most recently sent skb by choosing the skb with the highest prior_delivered count. But in the interval between receiving ACKs, we send multiple skbs with the same prior_delivered, because the tp->delivered only changes when we receive an ACK. We used RACK's solution, copying tcp_rack_sent_after() as tcp_skb_sent_after() helper to determine "which packet was sent last?". Later, we will use tcp_skb_sent_after() instead in RACK. Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/1650422081-22153-1-git-send-email-yangpc@wangsu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-27ipv6: make ip6_rt_gc_expire an atomic_tEric Dumazet
[ Upstream commit 9cb7c013420f98fa6fd12fc6a5dc055170c108db ] Reads and Writes to ip6_rt_gc_expire always have been racy, as syzbot reported lately [1] There is a possible risk of under-flow, leading to unexpected high value passed to fib6_run_gc(), although I have not observed this in the field. Hosts hitting ip6_dst_gc() very hard are under pretty bad state anyway. [1] BUG: KCSAN: data-race in ip6_dst_gc / ip6_dst_gc read-write to 0xffff888102110744 of 4 bytes by task 13165 on cpu 1: ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311 dst_alloc+0x9b/0x160 net/core/dst.c:86 ip6_dst_alloc net/ipv6/route.c:344 [inline] icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261 mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807 mld_send_cr net/ipv6/mcast.c:2119 [inline] mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651 process_one_work+0x3d3/0x720 kernel/workqueue.c:2289 worker_thread+0x618/0xa70 kernel/workqueue.c:2436 kthread+0x1a9/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 read-write to 0xffff888102110744 of 4 bytes by task 11607 on cpu 0: ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311 dst_alloc+0x9b/0x160 net/core/dst.c:86 ip6_dst_alloc net/ipv6/route.c:344 [inline] icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261 mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807 mld_send_cr net/ipv6/mcast.c:2119 [inline] mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651 process_one_work+0x3d3/0x720 kernel/workqueue.c:2289 worker_thread+0x618/0xa70 kernel/workqueue.c:2436 kthread+0x1a9/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 value changed: 0x00000bb3 -> 0x00000ba9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 11607 Comm: kworker/0:21 Not tainted 5.18.0-rc1-syzkaller-00037-g42e7a03d3bad-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: mld mld_ifc_work Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220413181333.649424-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-27esp: limit skb_page_frag_refill use to a single pageSabrina Dubroca
[ Upstream commit 5bd8baab087dff657e05387aee802e70304cc813 ] Commit ebe48d368e97 ("esp: Fix possible buffer overflow in ESP transformation") tried to fix skb_page_frag_refill usage in ESP by capping allocsize to 32k, but that doesn't completely solve the issue, as skb_page_frag_refill may return a single page. If that happens, we will write out of bounds, despite the check introduced in the previous patch. This patch forces COW in cases where we would end up calling skb_page_frag_refill with a size larger than a page (first in esp_output_head with tailen, then in esp_output_tail with skb->data_len). Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-20net/sched: flower: fix parsing of ethertype following VLAN headerVlad Buslov
[ Upstream commit 2105f700b53c24aa48b65c15652acc386044d26a ] A tc flower filter matching TCA_FLOWER_KEY_VLAN_ETH_TYPE is expected to match the L2 ethertype following the first VLAN header, as confirmed by linked discussion with the maintainer. However, such rule also matches packets that have additional second VLAN header, even though filter has both eth_type and vlan_ethtype set to "ipv4". Looking at the code this seems to be mostly an artifact of the way flower uses flow dissector. First, even though looking at the uAPI eth_type and vlan_ethtype appear like a distinct fields, in flower they are all mapped to the same key->basic.n_proto. Second, flow dissector skips following VLAN header as no keys for FLOW_DISSECTOR_KEY_CVLAN are set and eventually assigns the value of n_proto to last parsed header. With these, such filters ignore any headers present between first VLAN header and first "non magic" header (ipv4 in this case) that doesn't result FLOW_DISSECT_RET_PROTO_AGAIN. Fix the issue by extending flow dissector VLAN key structure with new 'vlan_eth_type' field that matches first ethertype following previously parsed VLAN header. Modify flower classifier to set the new flow_dissector_key_vlan->vlan_eth_type with value obtained from TCA_FLOWER_KEY_VLAN_ETH_TYPE/TCA_FLOWER_KEY_CVLAN_ETH_TYPE uAPIs. Link: https://lore.kernel.org/all/Yjhgi48BpTGh6dig@nanopsycho/ Fixes: 9399ae9a6cb2 ("net_sched: flower: Add vlan support") Fixes: d64efd0926ba ("net/sched: flower: Add supprt for matching on QinQ vlan headers") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13mctp: Use output netdev to allocate skb headroomMatt Johnston
[ Upstream commit 4a9dda1c1da65beee994f0977a56a9a21c5db2a7 ] Previously the skb was allocated with headroom MCTP_HEADER_MAXLEN, but that isn't sufficient if we are using devs that are not MCTP specific. This also adds a check that the smctp_halen provided to sendmsg for extended addressing is the correct size for the netdev. Fixes: 833ef3b91de6 ("mctp: Populate socket implementation") Reported-by: Matthew Rinaldi <mjrinal@g.clemson.edu> Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg}Luiz Augusto von Dentz
[ Upstream commit 9b392e0e0b6d026da5a62bb79a08f32e27af858e ] This fixes attemting to print hdev->name directly which causes them to print an error: kernel: read_version:367: (efault): sock 000000006a3008f2 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13ipv4: Invalidate neighbour for broadcast address upon address additionIdo Schimmel
[ Upstream commit 0c51e12e218f20b7d976158fdc18019627326f7a ] In case user space sends a packet destined to a broadcast address when a matching broadcast route is not configured, the kernel will create a unicast neighbour entry that will never be resolved [1]. When the broadcast route is configured, the unicast neighbour entry will not be invalidated and continue to linger, resulting in packets being dropped. Solve this by invalidating unresolved neighbour entries for broadcast addresses after routes for these addresses are internally configured by the kernel. This allows the kernel to create a broadcast neighbour entry following the next route lookup. Another possible solution that is more generic but also more complex is to have the ARP code register a listener to the FIB notification chain and invalidate matching neighbour entries upon the addition of broadcast routes. It is also possible to wave off the issue as a user space problem, but it seems a bit excessive to expect user space to be that intimately familiar with the inner workings of the FIB/neighbour kernel code. [1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/ Reported-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13net: initialize init_net earlierEric Dumazet
[ Upstream commit 9c1be1935fb68b2413796cdc03d019b8cf35ab51 ] While testing a patch that will follow later ("net: add netns refcount tracker to struct nsproxy") I found that devtmpfs_init() was called before init_net was initialized. This is a bug, because devtmpfs_setup() calls ksys_unshare(CLONE_NEWNS); This has the effect of increasing init_net refcount, which will be later overwritten to 1, as part of setup_net(&init_net) We had too many prior patches [1] trying to work around the root cause. Really, make sure init_net is in BSS section, and that net_ns_init() is called earlier at boot time. Note that another patch ("vfs: add netns refcount tracker to struct fs_context") also will need net_ns_init() being called before vfs_caches_init() As a bonus, this patch saves around 4KB in .data section. [1] f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace") b5082df8019a ("net: Initialise init_net.count to 1") 734b65417b24 ("net: Statically initialize init_net.dev_base_head") v2: fixed a build error reported by kernel build bots (CONFIG_NET=n) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13Bluetooth: hci_event: Ignore multiple conn complete eventsSoenke Huster
[ Upstream commit d5ebaa7c5f6f688959e8d40840b2249ede63b8ed ] When one of the three connection complete events is received multiple times for the same handle, the device is registered multiple times which leads to memory corruptions. Therefore, consequent events for a single connection are ignored. The conn->state can hold different values, therefore HCI_CONN_HANDLE_UNSET is introduced to identify new connections. To make sure the events do not contain this or another invalid handle HCI_CONN_HANDLE_MAX and checks are introduced. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=215497 Signed-off-by: Soenke Huster <soenke.huster@eknoes.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08netfilter: conntrack: Add and use nf_ct_set_auto_assign_helper_warned()Phil Sutter
[ Upstream commit 31d0bb9763efad30377505f3467f958d1ebe1e3d ] The function sets the pernet boolean to avoid the spurious warning from nf_ct_lookup_helper() when assigning conntrack helpers via nftables. Fixes: 1a64edf54f55 ("netfilter: nft_ct: add helper set support") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08netfilter: flowtable: Fix QinQ and pppoe support for inet tablePablo Neira Ayuso
[ Upstream commit 0492d857636e1c52cd71594a723c4b26a7b31978 ] nf_flow_offload_inet_hook() does not check for 802.1q and PPPoE. Fetch inner ethertype from these encapsulation protocols. Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Fixes: 4cd91f7c290f ("netfilter: flowtable: add vlan support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-28Bluetooth: hci_sync: Add a new quirk to skip HCI_FLT_CLEAR_ALLIsmael Ferreras Morezuelas
commit 0eaecfb2e4814d51ab172df3823e35d7c488b6d2 upstream. Some controllers have problems with being sent a command to clear all filtering. While the HCI code does not unconditionally send a clear-all anymore at BR/EDR setup (after the state machine refactor), there might be more ways of hitting these codepaths in the future as the kernel develops. Cc: stable@vger.kernel.org Cc: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ismael Ferreras Morezuelas <swyterzone@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net coming late in the 5.17-rc process: 1) Revert port remap to mitigate shadowing service ports, this is causing problems in existing setups and this mitigation can be achieved with explicit ruleset, eg. ... tcp sport < 16386 tcp dport >= 32768 masquerade random This patches provided a built-in policy similar to the one described above. 2) Disable register tracking infrastructure in nf_tables. Florian reported two issues: - Existing expressions with no implemented .reduce interface that causes data-store on register should cancel the tracking. - Register clobbering might be possible storing data on registers that are larger than 32-bits. This might lead to generating incorrect ruleset bytecode. These two issues are scheduled to be addressed in the next release cycle. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: disable register tracking Revert "netfilter: conntrack: tag conntracks picked up in local out hook" Revert "netfilter: nat: force port remap to prevent shadowing well-known ports" ==================== Link: https://lore.kernel.org/r/20220312220315.64531-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-11vsock: each transport cycles only on its own socketsJiyong Park
When iterating over sockets using vsock_for_each_connected_socket, make sure that a transport filters out sockets that don't belong to the transport. There actually was an issue caused by this; in a nested VM configuration, destroying the nested VM (which often involves the closing of /dev/vhost-vsock if there was h2g connections to the nested VM) kills not only the h2g connections, but also all existing g2h connections to the (outmost) host which are totally unrelated. Tested: Executed the following steps on Cuttlefish (Android running on a VM) [1]: (1) Enter into an `adb shell` session - to have a g2h connection inside the VM, (2) open and then close /dev/vhost-vsock by `exec 3< /dev/vhost-vsock && exec 3<&-`, (3) observe that the adb session is not reset. [1] https://android.googlesource.com/device/google/cuttlefish/ Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jiyong Park <jiyong@google.com> Link: https://lore.kernel.org/r/20220311020017.1509316-1-jiyong@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-09Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-03-09 1) Fix IPv6 PMTU discovery for xfrm interfaces. From Lina Wang. 2) Revert failing for policies and states that are configured with XFRMA_IF_ID 0. It broke a user configuration. From Kai Lueke. 3) Fix a possible buffer overflow in the ESP output path. 4) Fix ESP GSO for tunnel and BEET mode on inter address family tunnels. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-08Revert "netfilter: conntrack: tag conntracks picked up in local out hook"Florian Westphal
This was a prerequisite for the ill-fated "netfilter: nat: force port remap to prevent shadowing well-known ports". As this has been reverted, this change can be backed out too. Signed-off-by: Florian Westphal <fw@strlen.de>
2022-03-07esp: Fix possible buffer overflow in ESP transformationSteffen Klassert
The maximum message size that can be send is bigger than the maximum site that skb_page_frag_refill can allocate. So it is possible to write beyond the allocated buffer. Fix this by doing a fallback to COW in that case. v2: Avoid get get_order() costs as suggested by Linus Torvalds. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Reported-by: valis <sec@valis.email> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-03-03ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()Eric Dumazet
While investigating on why a synchronize_net() has been added recently in ipv6_mc_down(), I found that igmp6_event_query() and igmp6_event_report() might drop skbs in some cases. Discussion about removing synchronize_net() from ipv6_mc_down() will happen in a different thread. Fixes: f185de28d9ae ("mld: add new workqueues for process mld events") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220303173728.937869-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Use kfree_rcu(ptr, rcu) variant, using kfree_rcu(ptr) was not intentional. From Eric Dumazet. 2) Use-after-free in netfilter hook core, from Eric Dumazet. 3) Missing rcu read lock side for netfilter egress hook, from Florian Westphal. 4) nf_queue assume state->sk is full socket while it might not be. Invoke sock_gen_put(), from Florian Westphal. 5) Add selftest to exercise the reported KASAN splat in 4) 6) Fix possible use-after-free in nf_queue in case sk_refcnt is 0. Also from Florian. 7) Use input interface index only for hardware offload, not for the software plane. This breaks tc ct action. Patch from Paul Blakey. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: net/sched: act_ct: Fix flow table lookup failure with no originating ifindex netfilter: nf_queue: handle socket prefetch netfilter: nf_queue: fix possible use-after-free selftests: netfilter: add nfqueue TCP_NEW_SYN_RECV socket race test netfilter: nf_queue: don't assume sk is full socket netfilter: egress: silence egress hook lockdep splats netfilter: fix use-after-free in __nf_register_net_hook() netfilter: nf_tables: prefer kfree_rcu(ptr, rcu) variant ==================== Link: https://lore.kernel.org/r/20220301215337.378405-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-01net/sched: act_ct: Fix flow table lookup failure with no originating ifindexPaul Blakey
After cited commit optimizted hw insertion, flow table entries are populated with ifindex information which was intended to only be used for HW offload. This tuple ifindex is hashed in the flow table key, so it must be filled for lookup to be successful. But tuple ifindex is only relevant for the netfilter flowtables (nft), so it's not filled in act_ct flow table lookup, resulting in lookup failure, and no SW offload and no offload teardown for TCP connection FIN/RST packets. To fix this, add new tc ifindex field to tuple, which will only be used for offloading, not for lookup, as it will not be part of the tuple hash. Fixes: 9795ded7f924 ("net/sched: act_ct: Fill offloading tuple iifidx") Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-03-01netfilter: nf_queue: fix possible use-after-freeFlorian Westphal
Eric Dumazet says: The sock_hold() side seems suspect, because there is no guarantee that sk_refcnt is not already 0. On failure, we cannot queue the packet and need to indicate an error. The packet will be dropped by the caller. v2: split skb prefetch hunk into separate change Fixes: 271b72c7fa82c ("udp: RCU handling for Unicast packets.") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-02-25Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Fix PMTU for IPv6 if the reported MTU minus the ESP overhead is smaller than 1280. From Jiri Bohac. 2) Fix xfrm interface ID and inter address family tunneling when migrating xfrm states. From Yan Yan. 3) Add missing xfrm intrerface ID initialization on xfrmi_changelink. From Antony Antony. 4) Enforce validity of xfrm offload input flags so that userspace can't send undefined flags to the offload driver. From Leon Romanovsky. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-24Bluetooth: hci_sync: Fix not using conn_timeoutLuiz Augusto von Dentz
When using hci_le_create_conn_sync it shall wait for the conn_timeout since the connection complete may take longer than just 2 seconds. Also fix the masking of HCI_EV_LE_ENHANCED_CONN_COMPLETE and HCI_EV_LE_CONN_COMPLETE so they are never both set so we can predict which one the controller will use in case of HCI_OP_LE_CREATE_CONN. Fixes: 6cd29ec6ae5e3 ("Bluetooth: hci_sync: Wait for proper events when connecting LE") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-02-24Bluetooth: Fix bt_skb_sendmmsg not allocating partial chunksLuiz Augusto von Dentz
Since bt_skb_sendmmsg can be used with the likes of SOCK_STREAM it shall return the partial chunks it could allocate instead of freeing everything as otherwise it can cause problems like bellow. Fixes: 81be03e026dc ("Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg") Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/r/d7206e12-1b99-c3be-84f4-df22af427ef5@molgen.mpg.de BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215594 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> (Nokia N9 (MeeGo/Harmattan) Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-02-24openvswitch: Fix setting ipv6 fields causing hw csum failurePaul Blakey
Ipv6 ttl, label and tos fields are modified without first pulling/pushing the ipv6 header, which would have updated the hw csum (if available). This might cause csum validation when sending the packet to the stack, as can be seen in the trace below. Fix this by updating skb->csum if available. Trace resulted by ipv6 ttl dec and then sending packet to conntrack [actions: set(ipv6(hlimit=63)),ct(zone=99)]: [295241.900063] s_pf0vf2: hw csum failure [295241.923191] Call Trace: [295241.925728] <IRQ> [295241.927836] dump_stack+0x5c/0x80 [295241.931240] __skb_checksum_complete+0xac/0xc0 [295241.935778] nf_conntrack_tcp_packet+0x398/0xba0 [nf_conntrack] [295241.953030] nf_conntrack_in+0x498/0x5e0 [nf_conntrack] [295241.958344] __ovs_ct_lookup+0xac/0x860 [openvswitch] [295241.968532] ovs_ct_execute+0x4a7/0x7c0 [openvswitch] [295241.979167] do_execute_actions+0x54a/0xaa0 [openvswitch] [295242.001482] ovs_execute_actions+0x48/0x100 [openvswitch] [295242.006966] ovs_dp_process_packet+0x96/0x1d0 [openvswitch] [295242.012626] ovs_vport_receive+0x6c/0xc0 [openvswitch] [295242.028763] netdev_frame_hook+0xc0/0x180 [openvswitch] [295242.034074] __netif_receive_skb_core+0x2ca/0xcb0 [295242.047498] netif_receive_skb_internal+0x3e/0xc0 [295242.052291] napi_gro_receive+0xba/0xe0 [295242.056231] mlx5e_handle_rx_cqe_mpwrq_rep+0x12b/0x250 [mlx5_core] [295242.062513] mlx5e_poll_rx_cq+0xa0f/0xa30 [mlx5_core] [295242.067669] mlx5e_napi_poll+0xe1/0x6b0 [mlx5_core] [295242.077958] net_rx_action+0x149/0x3b0 [295242.086762] __do_softirq+0xd7/0x2d6 [295242.090427] irq_exit+0xf7/0x100 [295242.093748] do_IRQ+0x7f/0xd0 [295242.096806] common_interrupt+0xf/0xf [295242.100559] </IRQ> [295242.102750] RIP: 0033:0x7f9022e88cbd [295242.125246] RSP: 002b:00007f9022282b20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda [295242.132900] RAX: 0000000000000005 RBX: 0000000000000010 RCX: 0000000000000000 [295242.140120] RDX: 00007f9022282ba8 RSI: 00007f9022282a30 RDI: 00007f9014005c30 [295242.147337] RBP: 00007f9014014d60 R08: 0000000000000020 R09: 00007f90254a8340 [295242.154557] R10: 00007f9022282a28 R11: 0000000000000246 R12: 0000000000000000 [295242.161775] R13: 00007f902308c000 R14: 000000000000002b R15: 00007f9022b71f40 Fixes: 3fdbd1ce11e5 ("openvswitch: add ipv6 'set' action") Signed-off-by: Paul Blakey <paulb@nvidia.com> Link: https://lore.kernel.org/r/20220223163416.24096-1-paulb@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net This is fixing up the use without proper initialization in patch 5/5 -o- Hi, The following patchset contains Netfilter fixes for net: 1) Missing #ifdef CONFIG_IP6_NF_IPTABLES in recent xt_socket fix. 2) Fix incorrect flow action array size in nf_tables. 3) Unregister flowtable hooks from netns exit path. 4) Fix missing limit object release, from Florian Westphal. 5) Memleak in nf_tables object update path, also from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-20netfilter: nf_tables_offload: incorrect flow offload action array sizePablo Neira Ayuso
immediate verdict expression needs to allocate one slot in the flow offload action array, however, immediate data expression does not need to do so. fwd and dup expression need to allocate one slot, this is missing. Add a new offload_action interface to report if this expression needs to allocate one slot in the flow offload action array. Fixes: be2861dc36d7 ("netfilter: nft_{fwd,dup}_netdev: add offload support") Reported-and-tested-by: Nick Gregory <Nick.Gregory@Sophos.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-19net: Force inlining of checksum functions in net/checksum.hChristophe Leroy
All functions defined as static inline in net/checksum.h are meant to be inlined for performance reason. But since commit ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") the compiler is allowed to uninline functions when it wants. Fair enough in the general case, but for tiny performance critical checksum helpers that's counter-productive. The problem mainly arises when selecting CONFIG_CC_OPTIMISE_FOR_SIZE, Those helpers being 'static inline' in header files you suddenly find them duplicated many times in the resulting vmlinux. Here is a typical exemple when building powerpc pmac32_defconfig with CONFIG_CC_OPTIMISE_FOR_SIZE. csum_sub() appears 4 times: c04a23cc <csum_sub>: c04a23cc: 7c 84 20 f8 not r4,r4 c04a23d0: 7c 63 20 14 addc r3,r3,r4 c04a23d4: 7c 63 01 94 addze r3,r3 c04a23d8: 4e 80 00 20 blr ... c04a2ce8: 4b ff f6 e5 bl c04a23cc <csum_sub> ... c04a2d2c: 4b ff f6 a1 bl c04a23cc <csum_sub> ... c04a2d54: 4b ff f6 79 bl c04a23cc <csum_sub> ... c04a754c <csum_sub>: c04a754c: 7c 84 20 f8 not r4,r4 c04a7550: 7c 63 20 14 addc r3,r3,r4 c04a7554: 7c 63 01 94 addze r3,r3 c04a7558: 4e 80 00 20 blr ... c04ac930: 4b ff ac 1d bl c04a754c <csum_sub> ... c04ad264: 4b ff a2 e9 bl c04a754c <csum_sub> ... c04e3b08 <csum_sub>: c04e3b08: 7c 84 20 f8 not r4,r4 c04e3b0c: 7c 63 20 14 addc r3,r3,r4 c04e3b10: 7c 63 01 94 addze r3,r3 c04e3b14: 4e 80 00 20 blr ... c04e5788: 4b ff e3 81 bl c04e3b08 <csum_sub> ... c04e65c8: 4b ff d5 41 bl c04e3b08 <csum_sub> ... c0512d34 <csum_sub>: c0512d34: 7c 84 20 f8 not r4,r4 c0512d38: 7c 63 20 14 addc r3,r3,r4 c0512d3c: 7c 63 01 94 addze r3,r3 c0512d40: 4e 80 00 20 blr ... c0512dfc: 4b ff ff 39 bl c0512d34 <csum_sub> ... c05138bc: 4b ff f4 79 bl c0512d34 <csum_sub> ... Restore the expected behaviour by using __always_inline for all functions defined in net/checksum.h vmlinux size is even reduced by 256 bytes with this patch: text data bss dec hex filename 6980022 2515362 194384 9689768 93daa8 vmlinux.before 6979862 2515266 194384 9689512 93d9a8 vmlinux.now Fixes: ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net>