summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2026-03-04net: nfc: nci: Fix parameter validation for packet dataMichael Thalmeier
[ Upstream commit 571dcbeb8e635182bb825ae758399831805693c2 ] Since commit 9c328f54741b ("net: nfc: nci: Add parameter validation for packet data") communication with nci nfc chips is not working any more. The mentioned commit tries to fix access of uninitialized data, but failed to understand that in some cases the data packet is of variable length and can therefore not be compared to the maximum packet length given by the sizeof(struct). Fixes: 9c328f54741b ("net: nfc: nci: Add parameter validation for packet data") Cc: stable@vger.kernel.org Signed-off-by: Michael Thalmeier <michael.thalmeier@hale.at> Reported-by: syzbot+740e04c2a93467a0f8c8@syzkaller.appspotmail.com Link: https://patch.msgid.link/20260218083000.301354-1-michael.thalmeier@hale.at Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net/sched: act_skbedit: fix divide-by-zero in tcf_skbedit_hash()Ruitong Liu
[ Upstream commit be054cc66f739a9ba615dba9012a07fab8e7dd6f ] Commit 38a6f0865796 ("net: sched: support hash selecting tx queue") added SKBEDIT_F_TXQ_SKBHASH support. The inclusive range size is computed as: mapping_mod = queue_mapping_max - queue_mapping + 1; The range size can be 65536 when the requested range covers all possible u16 queue IDs (e.g. queue_mapping=0 and queue_mapping_max=U16_MAX). That value cannot be represented in a u16 and previously wrapped to 0, so tcf_skbedit_hash() could trigger a divide-by-zero: queue_mapping += skb_get_hash(skb) % params->mapping_mod; Compute mapping_mod in a wider type and reject ranges larger than U16_MAX to prevent params->mapping_mod from becoming 0 and avoid the crash. Fixes: 38a6f0865796 ("net: sched: support hash selecting tx queue") Cc: stable@vger.kernel.org # 6.12+ Signed-off-by: Ruitong Liu <cnitlrt@gmail.com> Link: https://patch.msgid.link/20260213175948.1505257-1-cnitlrt@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()Qanux
[ Upstream commit 6db8b56eed62baacaf37486e83378a72635c04cc ] On the receive path, __ioam6_fill_trace_data() uses trace->nodelen to decide how much data to write for each node. It trusts this field as-is from the incoming packet, with no consistency check against trace->type (the 24-bit field that tells which data items are present). A crafted packet can set nodelen=0 while setting type bits 0-21, causing the function to write ~100 bytes past the allocated region (into skb_shared_info), which corrupts adjacent heap memory and leads to a kernel panic. Add a shared helper ioam6_trace_compute_nodelen() in ioam6.c to derive the expected nodelen from the type field, and use it: - in ioam6_iptunnel.c (send path, existing validation) to replace the open-coded computation; - in exthdrs.c (receive path, ipv6_hop_ioam) to drop packets whose nodelen is inconsistent with the type field, before any data is written. Per RFC 9197, bits 12-21 are each short (4-octet) fields, so they are included in IOAM6_MASK_SHORT_FIELDS (changed from 0xff100000 to 0xff1ffc00). Fixes: 9ee11f0fff20 ("ipv6: ioam: Data plane support for Pre-allocated Trace") Cc: stable@vger.kernel.org Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260211040412.86195-1-qjx1298677004@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04tipc: fix RCU dereference race in tipc_aead_users_dec()Daniel Hodges
[ Upstream commit 6a65c0cb0ff20b3cbc5f1c87b37dd22cdde14a1c ] tipc_aead_users_dec() calls rcu_dereference(aead) twice: once to store in 'tmp' for the NULL check, and again inside the atomic_add_unless() call. Use the already-dereferenced 'tmp' pointer consistently, matching the correct pattern used in tipc_aead_users_inc() and tipc_aead_users_set(). Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Cc: stable@vger.kernel.org Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Hodges <hodgesd@meta.com> Link: https://patch.msgid.link/20260203145621.17399-1-git@danielhodges.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04netfilter: nf_conntrack_h323: fix OOB read in decode_choice()Vahagn Vardanian
[ Upstream commit baed0d9ba91d4f390da12d5039128ee897253d60 ] In decode_choice(), the boundary check before get_len() uses the variable `len`, which is still 0 from its initialization at the top of the function: unsigned int type, ext, len = 0; ... if (ext || (son->attr & OPEN)) { BYTE_ALIGN(bs); if (nf_h323_error_boundary(bs, len, 0)) /* len is 0 here */ return H323_ERROR_BOUND; len = get_len(bs); /* OOB read */ When the bitstream is exactly consumed (bs->cur == bs->end), the check nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end), which is false. The subsequent get_len() call then dereferences *bs->cur++, reading 1 byte past the end of the buffer. If that byte has bit 7 set, get_len() reads a second byte as well. This can be triggered remotely by sending a crafted Q.931 SETUP message with a User-User Information Element containing exactly 2 bytes of PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with the nf_conntrack_h323 helper active. The decoder fully consumes the PER buffer before reaching this code path, resulting in a 1-2 byte heap-buffer-overflow read confirmed by AddressSanitizer. Fix this by checking for 2 bytes (the maximum that get_len() may read) instead of the uninitialized `len`. This matches the pattern used at every other get_len() call site in the same file, where the caller checks for 2 bytes of available data before calling get_len(). Fixes: ec8a8f3c31dd ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well") Signed-off-by: Vahagn Vardanian <vahagn@redrays.io> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net: consume xmit errors of GSO framesJakub Kicinski
[ Upstream commit 7aa767d0d3d04e50ae94e770db7db8197f666970 ] udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests currently in NIPA. They fail in the same exact way, TCP GRO test stalls occasionally and the test gets killed after 10min. These tests use veth to simulate GRO. They attach a trivial ("return XDP_PASS;") XDP program to the veth to force TSO off and NAPI on. Digging into the failure mode we can see that the connection is completely stuck after a burst of drops. The sender's snd_nxt is at sequence number N [1], but the receiver claims to have received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle is that senders rtx queue is not empty (let's say the block in the rtx queue is at sequence number N - 4 * MSS [3]). In this state, sender sends a retransmission from the rtx queue with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3]. Receiver sees it and responds with an ACK all the way up to N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA because it has no recollection of ever sending data that far out [1]. And we are stuck. The root cause is the mess of the xmit return codes. veth returns an error when it can't xmit a frame. We end up with a loss event like this: ------------------------------------------------- | GSO super frame 1 | GSO super frame 2 | |-----------------------------------------------| | seg | seg | seg | seg | seg | seg | seg | seg | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------- x ok ok <ok>| ok ok ok <x> \\ snd_nxt "x" means packet lost by veth, and "ok" means it went thru. Since veth has TSO disabled in this test it sees individual segments. Segment 1 is on the retransmit queue and will be resent. So why did the sender not advance snd_nxt even tho it clearly did send up to seg 8? tcp_write_xmit() interprets the return code from the core to mean that data has not been sent at all. Since TCP deals with GSO super frames, not individual segment the crux of the problem is that loss of a single segment can be interpreted as loss of all. TCP only sees the last return code for the last segment of the GSO frame (in <> brackets in the diagram above). Of course for the problem to occur we need a setup or a device without a Qdisc. Otherwise Qdisc layer disconnects the protocol layer from the device errors completely. We have multiple ways to fix this. 1) make veth not return an error when it lost a packet. While this is what I think we did in the past, the issue keeps reappearing and it's annoying to debug. The game of whack a mole is not great. 2) fix the damn return codes We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the documentation, so maybe we should make the return code from ndo_start_xmit() a boolean. I like that the most, but perhaps some ancient, not-really-networking protocol would suffer. 3) make TCP ignore the errors It is not entirely clear to me what benefit TCP gets from interpreting the result of ip_queue_xmit()? Specifically once the connection is established and we're pushing data - packet loss is just packet loss? 4) this fix Ignore the rc in the Qdisc-less+GSO case, since it's unreliable. We already always return OK in the TCQ_F_CAN_BYPASS case. In the Qdisc-less case let's be a bit more conservative and only mask the GSO errors. This path is taken by non-IP-"networks" like CAN, MCTP etc, so we could regress some ancient thing. This is the simplest, but also maybe the hackiest fix? Similar fix has been proposed by Eric in the past but never committed because original reporter was working with an OOT driver and wasn't providing feedback (see Link). Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04tipc: fix duplicate publication key in tipc_service_insert_publ()Tung Nguyen
[ Upstream commit 3aa677625c8fad39989496c51bcff3872c1f16f1 ] TIPC uses named table to store TIPC services represented by type and instance. Each time an application calls TIPC API bind() to bind a type/instance to a socket, an entry is created and inserted into the named table. It looks like this: named table: key1, entry1 (type, instance ...) key2, entry2 (type, instance ...) In the above table, each entry represents a route for sending data from one socket to the other. For all publications originated from the same node, the key is UNIQUE to identify each entry. It is calculated by this formula: key = socket portid + number of bindings + 1 (1) where: - socket portid: unique and calculated by using linux kernel function get_random_u32_below(). So, the value is randomized. - number of bindings: the number of times a type/instance pair is bound to a socket. This number is linearly increased, starting from 0. While the socket portid is unique and randomized by linux kernel, the linear increment of "number of bindings" in formula (1) makes "key" not unique anymore. For example: - Socket 1 is created with its associated port number 20062001. Type 1000, instance 1 is bound to socket 1: key1: 20062001 + 0 + 1 = 20062002 Then, bind() is called a second time on Socket 1 to by the same type 1000, instance 1: key2: 20062001 + 1 + 1 = 20062003 Named table: key1 (20062002), entry1 (1000, 1 ...) key2 (20062003), entry2 (1000, 1 ...) - Socket 2 is created with its associated port number 20062002. Type 1000, instance 1 is bound to socket 2: key3: 20062002 + 0 + 1 = 20062003 TIPC looks up the named table and finds out that key2 with the same value already exists and rejects the insertion into the named table. This leads to failure of bind() call from application on Socket 2 with error message EINVAL "Invalid argument". This commit fixes this issue by adding more port id checking to make sure that the key is unique to publications originated from the same port id and node. Fixes: 218527fe27ad ("tipc: replace name table service range array with rb tree") Signed-off-by: Tung Nguyen <tung.quang.nguyen@est.tech> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260220050541.237962-1-tung.quang.nguyen@est.tech Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04Bluetooth: L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQLuiz Augusto von Dentz
[ Upstream commit 138d7eca445ef37a0333425d269ee59900ca1104 ] This adds a check for encryption key size upon receiving L2CAP_LE_CONN_REQ which is required by L2CAP/LE/CFC/BV-15-C which expects L2CAP_CR_LE_BAD_KEY_SIZE. Link: https://lore.kernel.org/linux-bluetooth/5782243.rdbgypaU67@n9w6sw14/ Fixes: 27e2d4c8d28b ("Bluetooth: Add basic LE L2CAP connect request receiving support") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Christian Eggers <ceggers@arri.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04Bluetooth: L2CAP: Fix not checking output MTU is acceptable on ↵Luiz Augusto von Dentz
L2CAP_ECRED_CONN_REQ [ Upstream commit a8d1d73c81d1e70d2aa49fdaf59d933bb783ffe5 ] Upon receiving L2CAP_ECRED_CONN_REQ the given MTU shall be checked against the suggested MTU of the listening socket as that is required by the likes of PTS L2CAP/ECFC/BV-27-C test which expects L2CAP_CR_LE_UNACCEPT_PARAMS if the MTU is lowers than socket omtu. In order to be able to set chan->omtu the code now allows setting setsockopt(BT_SNDMTU), but it is only allowed when connection has not been stablished since there is no procedure to reconfigure the output MTU. Link: https://github.com/bluez/bluez/issues/1895 Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04Bluetooth: L2CAP: Fix response to L2CAP_ECRED_CONN_REQLuiz Augusto von Dentz
[ Upstream commit 05761c2c2b5bfec85c47f60c903c461e9b56cf87 ] Similar to 03dba9cea72f ("Bluetooth: L2CAP: Fix not responding with L2CAP_CR_LE_ENCRYPTION") the result code L2CAP_CR_LE_ENCRYPTION shall be used when BT_SECURITY_MEDIUM is set since that means security mode 2 which mean it doesn't require authentication which results in qualification test L2CAP/ECFC/BV-32-C failing. Link: https://github.com/bluez/bluez/issues/1871 Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQLuiz Augusto von Dentz
[ Upstream commit 7accb1c4321acb617faf934af59d928b0b047e2b ] This fixes responding with an invalid result caused by checking the wrong size of CID which should have been (cmd_len - sizeof(*req)) and on top of it the wrong result was use L2CAP_CR_LE_INVALID_PARAMS which is invalid/reserved for reconf when running test like L2CAP/ECFC/BI-03-C: > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 64 MPS: 64 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reserved (0x000c) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fiix L2CAP/ECFC/BI-04-C which expects L2CAP_RECONF_INVALID_MPS (0x0002) when more than one channel gets its MPS reduced: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 8 MTU: 264 MPS: 99 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Fix L2CAP/ECFC/BI-05-C when SCID is invalid (85 unconnected): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 65 MPS: 64 ! Source CID: 85 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fix L2CAP/ECFC/BI-06-C when MPS < L2CAP_ECRED_MIN_MPS (64): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 672 ! MPS: 63 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Result: Reconfiguration failed - other unacceptable parameters (0x0004) Fix L2CAP/ECFC/BI-07-C when MPS reduced for more than one channel: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 3 len 8 MTU: 84 ! MPS: 71 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Link: https://github.com/bluez/bluez/issues/1865 Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04wifi: cfg80211: wext: fix IGTK key ID off-by-oneJohannes Berg
[ Upstream commit c8d7f21ead727485ebf965e2b4d42d4a4f0840f6 ] The IGTK key ID must be 4 or 5, but the code checks against key ID + 1, so must check against 5/6 rather than 4/5. Fix that. Reported-by: Jouni Malinen <j@w1.fi> Fixes: 08645126dd24 ("cfg80211: implement wext key handling") Link: https://patch.msgid.link/20260209181220.362205-2-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04xfrm: always flush state and policy upon NETDEV_UNREGISTER eventTetsuo Handa
[ Upstream commit 4efa91a28576054aae0e6dad9cba8fed8293aef8 ] syzbot is reporting that "struct xfrm_state" refcount is leaking. unregister_netdevice: waiting for netdevsim0 to become free. Usage count = 2 ref_tracker: netdev@ffff888052f24618 has 1/1 users at __netdev_tracker_alloc include/linux/netdevice.h:4400 [inline] netdev_tracker_alloc include/linux/netdevice.h:4412 [inline] xfrm_dev_state_add+0x3a5/0x1080 net/xfrm/xfrm_device.c:316 xfrm_state_construct net/xfrm/xfrm_user.c:986 [inline] xfrm_add_sa+0x34ff/0x5fa0 net/xfrm/xfrm_user.c:1022 xfrm_user_rcv_msg+0x58e/0xc00 net/xfrm/xfrm_user.c:3507 netlink_rcv_skb+0x158/0x420 net/netlink/af_netlink.c:2550 xfrm_netlink_rcv+0x71/0x90 net/xfrm/xfrm_user.c:3529 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8c8/0xdd0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa5d/0xc30 net/socket.c:2592 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2646 __sys_sendmsg+0x16d/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f This is because commit d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") implemented xfrm_dev_unregister() as no-op despite xfrm_dev_state_add() from xfrm_state_construct() acquires a reference to "struct net_device". I guess that that commit expected that NETDEV_DOWN event is fired before NETDEV_UNREGISTER event fires, and also assumed that xfrm_dev_state_add() is called only if (dev->features & NETIF_F_HW_ESP) != 0. Sabrina Dubroca identified steps to reproduce the same symptoms as below. echo 0 > /sys/bus/netdevsim/new_device dev=$(ls -1 /sys/bus/netdevsim/devices/netdevsim0/net/) ip xfrm state add src 192.168.13.1 dst 192.168.13.2 proto esp \ spi 0x1000 mode tunnel aead 'rfc4106(gcm(aes))' $key 128 \ offload crypto dev $dev dir out ethtool -K $dev esp-hw-offload off echo 0 > /sys/bus/netdevsim/del_device Like these steps indicate, the NETIF_F_HW_ESP bit can be cleared after xfrm_dev_state_add() acquired a reference to "struct net_device". Also, xfrm_dev_state_add() does not check for the NETIF_F_HW_ESP bit when acquiring a reference to "struct net_device". Commit 03891f820c21 ("xfrm: handle NETDEV_UNREGISTER for xfrm device") re-introduced the NETDEV_UNREGISTER event to xfrm_dev_event(), but that commit for unknown reason chose to share xfrm_dev_down() between the NETDEV_DOWN event and the NETDEV_UNREGISTER event. I guess that that commit missed the behavior in the previous paragraph. Therefore, we need to re-introduce xfrm_dev_unregister() in order to release the reference to "struct net_device" by unconditionally flushing state and policy. Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") Cc: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04xfrm: skip templates check for packet offload tunnel modeLeon Romanovsky
[ Upstream commit 0a4524bc69882a4ddb235bb6b279597721bda197 ] In packet offload, hardware is responsible to check templates. The result of its operation is forwarded through secpath by relevant drivers. That secpath is actually removed in __xfrm_policy_check2(). In case packet is forwarded, this secpath is reset in RX, but pushed again to TX where policy is rechecked again against dummy secpath in xfrm_policy_ok(). Such situation causes to unexpected XfrmInTmplMismatch increase. As a solution, simply skip template mismatch check. Fixes: 600258d555f0 ("xfrm: delete intermediate secpath entry in packet offload mode") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04xfrm6: fix uninitialized saddr in xfrm6_get_saddr()Jiayuan Chen
[ Upstream commit 1799d8abeabc68ec05679292aaf6cba93b343c05 ] xfrm6_get_saddr() does not check the return value of ipv6_dev_get_saddr(). When ipv6_dev_get_saddr() fails to find a suitable source address (returns -EADDRNOTAVAIL), saddr->in6 is left uninitialized, but xfrm6_get_saddr() still returns 0 (success). This causes the caller xfrm_tmpl_resolve_one() to use the uninitialized address in xfrm_state_find(), triggering KMSAN warning: ===================================================== BUG: KMSAN: uninit-value in xfrm_state_find+0x2424/0xa940 xfrm_state_find+0x2424/0xa940 xfrm_resolve_and_create_bundle+0x906/0x5a20 xfrm_lookup_with_ifid+0xcc0/0x3770 xfrm_lookup_route+0x63/0x2b0 ip_route_output_flow+0x1ce/0x270 udp_sendmsg+0x2ce1/0x3400 inet_sendmsg+0x1ef/0x2a0 __sock_sendmsg+0x278/0x3d0 __sys_sendto+0x593/0x720 __x64_sys_sendto+0x130/0x200 x64_sys_call+0x332b/0x3e70 do_syscall_64+0xd3/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable tmp.i.i created at: xfrm_resolve_and_create_bundle+0x3e3/0x5a20 xfrm_lookup_with_ifid+0xcc0/0x3770 ===================================================== Fix by checking the return value of ipv6_dev_get_saddr() and propagating the error. Fixes: a1e59abf8249 ("[XFRM]: Fix wildcard as tunnel source") Reported-by: syzbot+e136d86d34b42399a8b1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68bf1024.a70a0220.7a912.02c2.GAE@google.com/T/ Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04libceph: define and enforce CEPH_MAX_KEY_LENIlya Dryomov
[ Upstream commit ac431d597a9bdfc2ba6b314813f29a6ef2b4a3bf ] When decoding the key, verify that the key material would fit into a fixed-size buffer in process_auth_done() and generally has a sane length. The new CEPH_MAX_KEY_LEN check replaces the existing check for a key with no key material which is a) not universal since CEPH_CRYPTO_NONE has to be excluded and b) doesn't provide much value since a smaller than needed key is just as invalid as no key -- this has to be handled elsewhere anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net/rds: Clear reconnect pending bitHåkon Bugge
[ Upstream commit b89fc7c2523b2b0750d91840f4e52521270d70ed ] When canceling the reconnect worker, care must be taken to reset the reconnect-pending bit. If the reconnect worker has not yet been scheduled before it is canceled, the reconnect-pending bit will stay on forever. Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-6-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04vmw_vsock: bypass false-positive Wnonnull warning with gcc-16Arnd Bergmann
[ Upstream commit e25dbf561e03c0c5e36228e3b8b784392819ce85 ] The gcc-16.0.1 snapshot produces a false-positive warning that turns into a build failure with CONFIG_WERROR: In file included from arch/x86/include/asm/string.h:6, from net/vmw_vsock/vmci_transport.c:10: In function 'vmci_transport_packet_init', inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2: arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull] 150 | #define memcpy(t, f, n) __builtin_memcpy(t, f, n) | ^~~~~~~~~~~~~~~~~~~~~~~~~ net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); | ^~~~~~ arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy' net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy' 164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait)); | ^~~~~~ This seems relatively harmless, and it so far the only instance of this warning I have found. The __vmci_transport_send_control_pkt function is called either with wait=NULL or with one of the type values that pass 'wait' into memcpy() here, but not from the same caller. Replacing the memcpy with a struct assignment is otherwise the same but avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04Bluetooth: hci_conn: use mod_delayed_work for active mode timeoutStefan Sørensen
[ Upstream commit 49d0901e260739de2fcc90c0c29f9e31e39a2d9b ] hci_conn_enter_active_mode() uses queue_delayed_work() with the intention that the work will run after the given timeout. However, queue_delayed_work() does nothing if the work is already queued, so depending on the link policy we may end up putting the connection into idle mode every hdev->idle_timeout ms. Use mod_delayed_work() instead so the work is queued if not already queued, and the timeout is updated otherwise. Signed-off-by: Stefan Sørensen <ssorensen@roku.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04Bluetooth: hci_conn: Set link_policy on incoming ACL connectionsStefan Sørensen
[ Upstream commit 4bb091013ab0f2edfed3f58bebe658a798cbcc4d ] The connection link policy is only set when establishing an outgoing ACL connection causing connection idle modes not to be available on incoming connections. Move the setting of the link policy to the creation of the connection so all ACL connection will use the link policy set on the HCI device. Signed-off-by: Stefan Sørensen <ssorensen@roku.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv4: fib: Annotate access to struct fib_alias.fa_state.Kuniyuki Iwashima
[ Upstream commit 6e84fc395e90465f1418f582a9f7d53c87ab010e ] syzbot reported that struct fib_alias.fa_state can be modified locklessly by RCU readers. [0] Let's use READ_ONCE()/WRITE_ONCE() properly. [0]: BUG: KCSAN: data-race in fib_table_lookup / fib_table_lookup write to 0xffff88811b06a7fa of 1 bytes by task 4167 on cpu 0: fib_alias_accessed net/ipv4/fib_lookup.h:32 [inline] fib_table_lookup+0x361/0xd60 net/ipv4/fib_trie.c:1565 fib_lookup include/net/ip_fib.h:390 [inline] ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814 ip_route_output_key_hash net/ipv4/route.c:2705 [inline] __ip_route_output_key include/net/route.h:169 [inline] ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932 udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450 inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x53a/0x600 net/socket.c:2592 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646 __sys_sendmmsg+0x185/0x320 net/socket.c:2735 __do_sys_sendmmsg net/socket.c:2762 [inline] __se_sys_sendmmsg net/socket.c:2759 [inline] __x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759 x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff88811b06a7fa of 1 bytes by task 4168 on cpu 1: fib_alias_accessed net/ipv4/fib_lookup.h:31 [inline] fib_table_lookup+0x338/0xd60 net/ipv4/fib_trie.c:1565 fib_lookup include/net/ip_fib.h:390 [inline] ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814 ip_route_output_key_hash net/ipv4/route.c:2705 [inline] __ip_route_output_key include/net/route.h:169 [inline] ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932 udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450 inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x53a/0x600 net/socket.c:2592 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646 __sys_sendmmsg+0x185/0x320 net/socket.c:2735 __do_sys_sendmmsg net/socket.c:2762 [inline] __se_sys_sendmmsg net/socket.c:2759 [inline] __x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759 x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x00 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 4168 Comm: syz.4.206 Not tainted syzkaller #0 PREEMPT(voluntary) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Reported-by: syzbot+d24f940f770afda885cf@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69783ead.050a0220.c9109.0013.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260127043528.514160-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv4: igmp: annotate data-races around idev->mr_maxdelayEric Dumazet
[ Upstream commit e4faaf65a75f650ac4366ddff5dabb826029ca5a ] idev->mr_maxdelay is read and written locklessly, add READ_ONCE()/WRITE_ONCE() annotations. While we are at it, make this field an u32. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04gro: change the BUG_ON() in gro_pull_from_frag0()Eric Dumazet
[ Upstream commit cbe41362be2c27e0237a94a404ae413cec9c2ad9 ] Replace the BUG_ON() which never fired with a DEBUG_NET_WARN_ON_ONCE() $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 2/2 grow/shrink: 1/1 up/down: 370/-254 (116) Function old new delta gro_try_pull_from_frag0 - 196 +196 napi_gro_frags 771 929 +158 __pfx_gro_try_pull_from_frag0 - 16 +16 __pfx_gro_pull_from_frag0 16 - -16 dev_gro_receive 1514 1464 -50 gro_pull_from_frag0 188 - -188 Total: Before=22565899, After=22566015, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net/rds: No shortcut out of RDS_CONN_ERRORGerd Rausch
[ Upstream commit ad22d24be635c6beab6a1fdd3f8b1f3c478d15da ] RDS connections carry a state "rds_conn_path::cp_state" and transitions from one state to another and are conditional upon an expected state: "rds_conn_path_transition." There is one exception to this conditionality, which is "RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop" regardless of what state the condition is currently in. But as soon as a connection enters state "RDS_CONN_ERROR", the connection handling code expects it to go through the shutdown-path. The RDS/TCP multipath changes added a shortcut out of "RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING" via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change"). A subsequent "rds_tcp_reset_callbacks" can then transition the state to "RDS_CONN_RESETTING" with a shutdown-worker queued. That'll trip up "rds_conn_init_shutdown", which was never adjusted to handle "RDS_CONN_RESETTING" and subsequently drops the connection with the dreaded "DR_INV_CONN_STATE", which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever. So we do two things here: a) Don't shortcut "RDS_CONN_ERROR", but take the longer path through the shutdown code. b) Add "RDS_CONN_RESETTING" to the expected states in "rds_conn_init_shutdown" so that we won't error out and get stuck, if we ever hit weird state transitions like this again." Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260122055213.83608-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04netfilter: xt_tcpmss: check remaining length before reading optlenFlorian Westphal
[ Upstream commit 735ee8582da3d239eb0c7a53adca61b79fb228b3 ] Quoting reporter: In net/netfilter/xt_tcpmss.c (lines 53-68), the TCP option parser reads op[i+1] directly without validating the remaining option length. If the last byte of the option field is not EOL/NOP (0/1), the code attempts to index op[i+1]. In the case where i + 1 == optlen, this causes an out-of-bounds read, accessing memory past the optlen boundary (either reading beyond the stack buffer _opt or the following payload). Reported-by: sungzii <sungzii@pm.me> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04netfilter: nf_conntrack: Add allow_clash to generic protocol handlerYuto Hamaguchi
[ Upstream commit 8a49fc8d8a3e83dc51ec05bcd4007bdea3c56eec ] The upstream commit, 71d8c47fc653711c41bc3282e5b0e605b3727956 ("netfilter: conntrack: introduce clash resolution on insertion race"), sets allow_clash=true in the UDP/UDPLITE protocol handler but does not set it in the generic protocol handler. As a result, packets composed of connectionless protocols at each layer, such as UDP over IP-in-IP, still drop packets due to conflicts during conntrack insertion. To resolve this, this patch sets allow_clash in the nf_conntrack_l4proto_generic. Signed-off-by: Yuto Hamaguchi <Hamaguchi.Yuto@da.MitsubishiElectric.co.jp> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv6: exthdrs: annotate data-race over multiple sysctlEric Dumazet
[ Upstream commit 978b67d28358b0b4eacfa94453d1ad4e09b123ad ] Following four sysctls can change under us, add missing READ_ONCE(). - ipv6.sysctl.max_dst_opts_len - ipv6.sysctl.max_dst_opts_cnt - ipv6.sysctl.max_hbh_opts_len - ipv6.sysctl.max_hbh_opts_cnt Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04SUNRPC: fix gss_auth kref leak in gss_alloc_msg error pathDaniel Hodges
commit dd2fdc3504592d85e549c523b054898a036a6afe upstream. Commit 5940d1cf9f42 ("SUNRPC: Rebalance a kref in auth_gss.c") added a kref_get(&gss_auth->kref) call to balance the gss_put_auth() done in gss_release_msg(), but forgot to add a corresponding kref_put() on the error path when kstrdup_const() fails. If service_name is non-NULL and kstrdup_const() fails, the function jumps to err_put_pipe_version which calls put_pipe_version() and kfree(gss_msg), but never releases the gss_auth reference. This leads to a kref leak where the gss_auth structure is never freed. Add a forward declaration for gss_free_callback() and call kref_put() in the err_put_pipe_version error path to properly release the reference taken earlier. Fixes: 5940d1cf9f42 ("SUNRPC: Rebalance a kref in auth_gss.c") Cc: stable@vger.kernel.org Signed-off-by: Daniel Hodges <git@danielhodges.dev> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04SUNRPC: auth_gss: fix memory leaks in XDR decoding error pathsChuck Lever
commit 3e6397b056335cc56ef0e9da36c95946a19f5118 upstream. The gssx_dec_ctx(), gssx_dec_status(), and gssx_dec_name() functions allocate memory via gssx_dec_buffer(), which calls kmemdup(). When a subsequent decode operation fails, these functions return immediately without freeing previously allocated buffers, causing memory leaks. The leak in gssx_dec_ctx() is particularly relevant because the caller (gssp_accept_sec_context_upcall) initializes several buffer length fields to non-zero values, resulting in memory allocation: struct gssx_ctx rctxh = { .exported_context_token.len = GSSX_max_output_handle_sz, .mech.len = GSS_OID_MAX_LEN, .src_name.display_name.len = GSSX_max_princ_sz, .targ_name.display_name.len = GSSX_max_princ_sz }; If, for example, gssx_dec_name() succeeds for src_name but fails for targ_name, the memory allocated for exported_context_token, mech, and src_name.display_name remains unreferenced and cannot be reclaimed. Add error handling with goto-based cleanup to free any previously allocated buffers before returning an error. Reported-by: Xingjing Deng <micro6947@gmail.com> Closes: https://lore.kernel.org/linux-nfs/CAK+ZN9qttsFDu6h1FoqGadXjMx1QXqPMoYQ=6O9RY4SxVTvKng@mail.gmail.com/ Fixes: 1d658336b05f ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04netns-ipv4: reorganize netns_ipv4 fast path variablesCoco Li
[ Upstream commit 18fd64d2542292713b0322e6815be059bdee440c ] Reorganize fast path variables on tx-txrx-rx order. Fastpath cacheline ends after sysctl_tcp_rmem. There are only read-only variables here. (write is on the control path and not considered in this case) Below data generated with pahole on x86 architecture. Fast path variables span cache lines before change: 4 Fast path variables span cache lines after change: 2 Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 87b08913a9ae ("inet: move icmp_global_{credit,stamp} to a separate cache line") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04tcp: Set pingpong threshold via sysctlHaiyang Zhang
[ Upstream commit 562b1fdf061bff9394ccd884456ed1173c224fdc ] TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance. The pingpong threshold and related code were changed to 3 in the year 2019 in: commit 4a41f453bedf ("tcp: change pingpong threshold to 3") And reverted to 1 in the year 2022 in: commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") There is no single value that fits all applications. Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for optimal performance based on the application needs. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 87b08913a9ae ("inet: move icmp_global_{credit,stamp} to a separate cache line") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04tcp: defer regular ACK while processing socket backlogEric Dumazet
[ Upstream commit 133c4c0d37175f510a10fa9bed51e223936073fc ] This idea came after a particular workload requested the quickack attribute set on routes, and a performance drop was noticed for large bulk transfers. For high throughput flows, it is best to use one cpu running the user thread issuing socket system calls, and a separate cpu to process incoming packets from BH context. (With TSO/GRO, bottleneck is usually the 'user' cpu) Problem is the user thread can spend a lot of time while holding the socket lock, forcing BH handler to queue most of incoming packets in the socket backlog. Whenever the user thread releases the socket lock, it must first process all accumulated packets in the backlog, potentially adding latency spikes. Due to flood mitigation, having too many packets in the backlog increases chance of unexpected drops. Backlog processing unfortunately shifts a fair amount of cpu cycles from the BH cpu to the 'user' cpu, thus reducing max throughput. This patch takes advantage of the backlog processing, and the fact that ACK are mostly cumulative. The idea is to detect we are in the backlog processing and defer all eligible ACK into a single one, sent from tcp_release_cb(). This saves cpu cycles on both sides, and network resources. Performance of a single TCP flow on a 200Gbit NIC: - Throughput is increased by 20% (100Gbit -> 120Gbit). - Number of generated ACK per second shrinks from 240,000 to 40,000. - Number of backlog drops per second shrinks from 230 to 0. Benchmark context: - Regular netperf TCP_STREAM (no zerocopy) - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids) - MAX_SKB_FRAGS = 17 (~60KB per GRO packet) This feature is guarded by a new sysctl, and enabled by default: /proc/sys/net/ipv4/tcp_backlog_ack_defer Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: 87b08913a9ae ("inet: move icmp_global_{credit,stamp} to a separate cache line") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04icmp: prevent possible overflow in icmp_global_allow()Eric Dumazet
[ Upstream commit 034bbd806298e9ba4197dd1587b0348ee30996ea ] Following expression can overflow if sysctl_icmp_msgs_per_sec is big enough. sysctl_icmp_msgs_per_sec * delta / HZ; Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04icmp: icmp_msgs_per_sec and icmp_msgs_burst sysctls become per netnsEric Dumazet
[ Upstream commit f17bf505ff89595df5147755e51441632a5dc563 ] Previous patch made ICMP rate limits per netns, it makes sense to allow each netns to change the associated sysctl. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240829144641.3880376-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 034bbd806298 ("icmp: prevent possible overflow in icmp_global_allow()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04icmp: move icmp_global.credit and icmp_global.stamp to per netns storageEric Dumazet
[ Upstream commit b056b4cd9178f7a1d5d57f7b48b073c29729ddaa ] Host wide ICMP ratelimiter should be per netns, to provide better isolation. Following patch in this series makes the sysctl per netns. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240829144641.3880376-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 034bbd806298 ("icmp: prevent possible overflow in icmp_global_allow()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ping: annotate data-races in ping_lookup()Eric Dumazet
[ Upstream commit ad5dfde2a5733aaf652ea3e40c8c5e071e935901 ] isk->inet_num, isk->inet_rcv_saddr and sk->sk_bound_dev_if are read locklessly in ping_lookup(). Add READ_ONCE()/WRITE_ONCE() annotations. The race on isk->inet_rcv_saddr is probably coming from IPv6 support, but does not deserve a specific backport. Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216100149.3319315-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04netfilter: nf_tables: fix use-after-free in nf_tables_addchain()Inseo An
[ Upstream commit 71e99ee20fc3f662555118cf1159443250647533 ] nf_tables_addchain() publishes the chain to table->chains via list_add_tail_rcu() (in nft_chain_add()) before registering hooks. If nf_tables_register_hook() then fails, the error path calls nft_chain_del() (list_del_rcu()) followed by nf_tables_chain_destroy() with no RCU grace period in between. This creates two use-after-free conditions: 1) Control-plane: nf_tables_dump_chains() traverses table->chains under rcu_read_lock(). A concurrent dump can still be walking the chain when the error path frees it. 2) Packet path: for NFPROTO_INET, nf_register_net_hook() briefly installs the IPv4 hook before IPv6 registration fails. Packets entering nft_do_chain() via the transient IPv4 hook can still be dereferencing chain->blob_gen_X when the error path frees the chain. Add synchronize_rcu() between nft_chain_del() and the chain destroy so that all RCU readers -- both dump threads and in-flight packet evaluation -- have finished before the chain is freed. Fixes: 91c7b38dc9f0 ("netfilter: nf_tables: use new transaction infrastructure to handle chain") Signed-off-by: Inseo An <y0un9sa@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net: remove WARN_ON_ONCE when accessing forward path arrayPablo Neira Ayuso
[ Upstream commit 008e7a7c293b30bc43e4368dac6ea3808b75a572 ] Although unlikely, recent support for IPIP tunnels increases chances of reaching this WARN_ON_ONCE if userspace manages to build a sufficiently long forward path. Remove it. Fixes: ddb94eafab8b ("net: resolve forwarding path from virtual netdevice and HW destination address") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04netfilter: nf_conntrack_h323: don't pass uninitialised l3num valueFlorian Westphal
[ Upstream commit a6d28eb8efe96b3e35c92efdf1bfacb0cccf541f ] Mihail Milev reports: Error: UNINIT (CWE-457): net/netfilter/nf_conntrack_h323_main.c:1189:2: var_decl: Declaring variable "tuple" without initializer. net/netfilter/nf_conntrack_h323_main.c:1197:2: uninit_use_in_call: Using uninitialized value "tuple.src.l3num" when calling "__nf_ct_expect_find". net/netfilter/nf_conntrack_expect.c:142:2: read_value: Reading value "tuple->src.l3num" when calling "nf_ct_expect_dst_hash". 1195| tuple.dst.protonum = IPPROTO_TCP; 1196| 1197|-> exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple); 1198| if (exp && exp->master == ct) 1199| return exp; Switch this to a C99 initialiser and set the l3num value. Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net: bridge: mcast: always update mdb_n_entries for vlan contextsNikolay Aleksandrov
[ Upstream commit 8b769e311a86bb9d15c5658ad283b86fc8f080a2 ] syzbot triggered a warning[1] about the number of mdb entries in a context. It turned out that there are multiple ways to trigger that warning today (some got added during the years), the root cause of the problem is that the increase is done conditionally, and over the years these different conditions increased so there were new ways to trigger the warning, that is to do a decrease which wasn't paired with a previous increase. For example one way to trigger it is with flush: $ ip l add br0 up type bridge vlan_filtering 1 mcast_snooping 1 $ ip l add dumdum up master br0 type dummy $ bridge mdb add dev br0 port dumdum grp 239.0.0.1 permanent vid 1 $ ip link set dev br0 down $ ip link set dev br0 type bridge mcast_vlan_snooping 1 ^^^^ this will enable snooping, but will not update mdb_n_entries because in __br_multicast_enable_port_ctx() we check !netif_running $ bridge mdb flush dev br0 ^^^ this will trigger the warning because it will delete the pg which we added above, which will try to decrease mdb_n_entries Fix the problem by removing the conditional increase and always keep the count up-to-date while the vlan exists. In order to do that we have to first initialize it on port-vlan context creation, and then always increase or decrease the value regardless of mcast options. To keep the current behaviour we have to enforce the mdb limit only if the context is port's or if the port-vlan's mcast snooping is enabled. [1] ------------[ cut here ]------------ n == 0 WARNING: net/bridge/br_multicast.c:718 at br_multicast_port_ngroups_dec_one net/bridge/br_multicast.c:718 [inline], CPU#0: syz.4.4607/22043 WARNING: net/bridge/br_multicast.c:718 at br_multicast_port_ngroups_dec net/bridge/br_multicast.c:771 [inline], CPU#0: syz.4.4607/22043 WARNING: net/bridge/br_multicast.c:718 at br_multicast_del_pg+0x1bbe/0x1e20 net/bridge/br_multicast.c:825, CPU#0: syz.4.4607/22043 Modules linked in: CPU: 0 UID: 0 PID: 22043 Comm: syz.4.4607 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/24/2026 RIP: 0010:br_multicast_port_ngroups_dec_one net/bridge/br_multicast.c:718 [inline] RIP: 0010:br_multicast_port_ngroups_dec net/bridge/br_multicast.c:771 [inline] RIP: 0010:br_multicast_del_pg+0x1bbe/0x1e20 net/bridge/br_multicast.c:825 Code: 41 5f 5d e9 04 7a 48 f7 e8 3f 73 5c f7 90 0f 0b 90 e9 cf fd ff ff e8 31 73 5c f7 90 0f 0b 90 e9 16 fd ff ff e8 23 73 5c f7 90 <0f> 0b 90 e9 60 fd ff ff e8 15 73 5c f7 eb 05 e8 0e 73 5c f7 48 8b RSP: 0018:ffffc9000c207220 EFLAGS: 00010293 RAX: ffffffff8a68042d RBX: ffff88807c6f1800 RCX: ffff888066e90000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff888066e90000 R09: 000000000000000c R10: 000000000000000c R11: 0000000000000000 R12: ffff8880303ef800 R13: dffffc0000000000 R14: ffff888050eb11c4 R15: 1ffff1100a1d6238 FS: 00007fa45921b6c0(0000) GS:ffff8881256f5000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa4591f9ff8 CR3: 0000000081df2000 CR4: 00000000003526f0 Call Trace: <TASK> br_mdb_flush_pgs net/bridge/br_mdb.c:1525 [inline] br_mdb_flush net/bridge/br_mdb.c:1544 [inline] br_mdb_del_bulk+0x5e2/0xb20 net/bridge/br_mdb.c:1561 rtnl_mdb_del+0x48a/0x640 net/core/rtnetlink.c:-1 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6967 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa68/0xad0 net/socket.c:2592 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa45839aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fa45921b028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007fa458615fa0 RCX: 00007fa45839aeb9 RDX: 0000000000000000 RSI: 00002000000000c0 RDI: 0000000000000004 RBP: 00007fa458408c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa458616038 R14: 00007fa458615fa0 R15: 00007fff0b59fae8 </TASK> Fixes: b57e8d870d52 ("net: bridge: Maintain number of MDB entries in net_bridge_mcast_port") Reported-by: syzbot+d5d1b7343531d17bd3c5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/aYrWbRp83MQR1ife@debil/T/#t Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://patch.msgid.link/20260213070031.1400003-2-nikolay@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net/rds: rds_sendmsg should not discard payload_lenAllison Henderson
[ Upstream commit da29e453dcb3aa7cabead7915f5f945d0add3a52 ] Commit 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with connection teardown") modifies rds_sendmsg to avoid enqueueing work while a tear down is in progress. However, it also changed the return value of rds_sendmsg to that of rds_send_xmit instead of the payload_len. This means the user may incorrectly receive errno values when it should have simply received a payload of 0 while the peer attempts a reconnections. So this patch corrects the teardown handling code to only use the out error path in that case, thus restoring the original payload_len return value. Fixes: 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with connection teardown") Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260213035409.1963391-1-achender@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv6: Fix out-of-bound access in fib6_add_rt2node().Kuniyuki Iwashima
[ Upstream commit 8244f959e2c125c849e569f5b23ed49804cce695 ] syzbot reported out-of-bound read in fib6_add_rt2node(). [0] When IPv6 route is created with RTA_NH_ID, struct fib6_info does not have the trailing struct fib6_nh. The cited commit started to check !iter->fib6_nh->fib_nh_gw_family to ensure that rt6_qualify_for_ecmp() will return false for iter. If iter->nh is not NULL, rt6_qualify_for_ecmp() returns false anyway. Let's check iter->nh before reading iter->fib6_nh and avoid OOB read. [0]: BUG: KASAN: slab-out-of-bounds in fib6_add_rt2node+0x349c/0x3500 net/ipv6/ip6_fib.c:1142 Read of size 1 at addr ffff8880384ba6de by task syz.0.18/5500 CPU: 0 UID: 0 PID: 5500 Comm: syz.0.18 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xba/0x230 mm/kasan/report.c:482 kasan_report+0x117/0x150 mm/kasan/report.c:595 fib6_add_rt2node+0x349c/0x3500 net/ipv6/ip6_fib.c:1142 fib6_add_rt2node_nh net/ipv6/ip6_fib.c:1363 [inline] fib6_add+0x910/0x18c0 net/ipv6/ip6_fib.c:1531 __ip6_ins_rt net/ipv6/route.c:1351 [inline] ip6_route_add+0xde/0x1b0 net/ipv6/route.c:3957 inet6_rtm_newroute+0x268/0x19e0 net/ipv6/route.c:5660 rtnetlink_rcv_msg+0x7d5/0xbe0 net/core/rtnetlink.c:6958 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa68/0xad0 net/socket.c:2592 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f9316b9aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd8809b678 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f9316e15fa0 RCX: 00007f9316b9aeb9 RDX: 0000000000000000 RSI: 0000200000004380 RDI: 0000000000000003 RBP: 00007f9316c08c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f9316e15fac R14: 00007f9316e15fa0 R15: 00007f9316e15fa0 </TASK> Allocated by task 5499: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 poison_kmalloc_redzone mm/kasan/common.c:398 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 kasan_kmalloc include/linux/kasan.h:263 [inline] __do_kmalloc_node mm/slub.c:5657 [inline] __kmalloc_noprof+0x40c/0x7e0 mm/slub.c:5669 kmalloc_noprof include/linux/slab.h:961 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] fib6_info_alloc+0x30/0xf0 net/ipv6/ip6_fib.c:155 ip6_route_info_create+0x142/0x860 net/ipv6/route.c:3820 ip6_route_add+0x49/0x1b0 net/ipv6/route.c:3949 inet6_rtm_newroute+0x268/0x19e0 net/ipv6/route.c:5660 rtnetlink_rcv_msg+0x7d5/0xbe0 net/core/rtnetlink.c:6958 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0xa68/0xad0 net/socket.c:2592 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2646 __sys_sendmsg net/socket.c:2678 [inline] __do_sys_sendmsg net/socket.c:2683 [inline] __se_sys_sendmsg net/socket.c:2681 [inline] __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2681 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: bbf4a17ad9ff ("ipv6: Fix ECMP sibling count mismatch when clearing RTF_ADDRCONF") Reported-by: syzbot+707d6a5da1ab9e0c6f9d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/698cbfba.050a0220.2eeac1.009d.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://patch.msgid.link/20260211175133.3657034-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04RDMA/core: add rdma_rw_max_sge() helper for SQ sizingChuck Lever
[ Upstream commit afcae7d7b8a278a6c29e064f99e5bafd4ac1fb37 ] svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the number of rdma_rw contexts (ctxts). This value is used to allocate the Send CQ and to initialize the sc_sq_avail credit pool. However, when the device uses memory registration for RDMA operations, rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three per context to account for REG and INV work requests. The Send CQ and credit pool remain sized for only one work request per context, causing Send Queue exhaustion under heavy NFS WRITE workloads. Introduce rdma_rw_max_sge() to compute the actual number of Send Queue entries required for a given number of rdma_rw contexts. Upper layer protocols call this helper before creating a Queue Pair so that their Send CQs and credit accounting match the QP's true capacity. Update svc_rdma_accept() to use rdma_rw_max_sge() when computing sc_sq_depth, ensuring the credit pool reflects the work requests that rdma_rw_init_qp() will reserve. Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04svcrdma: Reduce the number of rdma_rw contexts per-QPChuck Lever
[ Upstream commit 59243315890578a040a2d50ae9e001a2ef2fcb62 ] There is an upper bound on the number of rdma_rw contexts that can be created per QP. This invisible upper bound is because rdma_create_qp() adds one or more additional SQEs for each ctxt that the ULP requests via qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on the order of the sum of qp_attr.cap.max_send_wr and a factor times qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending on whether MR operations are required before RDMA Reads. This limit is not visible to RDMA consumers via dev->attrs. When the limit is surpassed, QP creation fails with -ENOMEM. For example: svcrdma's estimate of the number of rdma_rw contexts it needs is three times the number of pages in RPCSVC_MAXPAGES. When MAXPAGES is about 260, the internally-computed SQ length should be: 64 credits + 10 backlog + 3 * (3 * 260) = 2414 Which is well below the advertised qp_max_wr of 32768. If RPCSVC_MAXPAGES is increased to 4MB, that's 1040 pages: 64 credits + 10 backlog + 3 * (3 * 1040) = 9434 However, QP creation fails. Dynamic printk for mlx5 shows: calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768) Although 9326 is still far below qp_max_wr, QP creation still fails. Because the total SQ length calculation is opaque to RDMA consumers, there doesn't seem to be much that can be done about this except for consumers to try to keep the requested rdma_rw ctxt count low. Fixes: 2da0f610e733 ("svcrdma: Increase the per-transport rw_ctx count") Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04svcrdma: Increase the per-transport rw_ctx countChuck Lever
[ Upstream commit 2da0f610e733606e06284ac3c1f188b9dec75d68 ] rdma_rw_mr_factor() returns the smallest number of MRs needed to move a particular number of pages. svcrdma currently asks for the number of MRs needed to move RPCSVC_MAXPAGES (a little over one megabyte), as that is the number of pages in the largest r/wsize the server supports. This call assumes that the client's NIC can bundle a full one megabyte payload in a single rdma_segment. In fact, most NICs cannot handle a full megabyte with a single rkey / rdma_segment. Clients will typically split even a single Read chunk into many segments. The server needs one MR to read each rdma_segment in a Read chunk, and thus each one needs an rw_ctx. svcrdma has been vastly underestimating the number of rw_ctxs needed to handle 64 RPC requests with large Read chunks using small rdma_segments. Unfortunately there doesn't seem to be a good way to estimate this number without knowing the client NIC's capabilities. Even then, the client RPC/RDMA implementation is still free to split a chunk into smaller segments (for example, it might be using physical registration, which needs an rdma_segment per page). The best we can do for now is choose a number that will guarantee forward progress in the worst case (one page per segment). At some later point, we could add some mechanisms to make this much less of a problem: - Add a core API to add more rw_ctxs to an already-established QP - svcrdma could treat rw_ctx exhaustion as a temporary error and try again - Limit the number of Reads in flight Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04svcrdma: Clean up comment in svc_rdma_accept()Chuck Lever
[ Upstream commit fc2e69db82c1ac506cd7f539a3ab66d51d3380dc ] The comment that starts "Qualify ..." applies to only some of the following code paragraph. Re-arrange the lines so the comment makes more sense. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04svcrdma: Remove queue-shortening warningsChuck Lever
[ Upstream commit b918bfcf370c92ea3b82fa9bb3d017702b5fa4cb ] These won't have much diagnostic value for site administrators. Since they can't be disabled, they become noise. What's more, the subsequent rdma_create_qp() call adjusts the Send Queue size (possibly downward) without warning, making the size reported by these pr_warns inaccurate. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04xfrm: fix ip_rt_bug race in icmp_route_lookup reverse pathJiayuan Chen
[ Upstream commit 81b84de32bb27ae1ae2eb9acf0420e9d0d14bf00 ] icmp_route_lookup() performs multiple route lookups to find a suitable route for sending ICMP error messages, with special handling for XFRM (IPsec) policies. The lookup sequence is: 1. First, lookup output route for ICMP reply (dst = original src) 2. Pass through xfrm_lookup() for policy check 3. If blocked (-EPERM) or dst is not local, enter "reverse path" 4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec which reverses the original packet's flow (saddr<->daddr swapped) 5. If fl4_dec.saddr is local (we are the original destination), use __ip_route_output_key() for output route lookup 6. If fl4_dec.saddr is NOT local (we are a forwarding node), use ip_route_input() to simulate the reverse packet's input path 7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr (original packet's source) as destination. If this address becomes local between the initial check and ip_route_input() call (e.g., due to concurrent "ip addr add"), ip_route_input() returns a LOCAL route with dst.output set to ip_rt_bug. This route is then used for ICMP output, causing dst_output() to call ip_rt_bug(), triggering a WARN_ON: ------------[ cut here ]------------ WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1 Call Trace: <TASK> ip_push_pending_frames+0x202/0x240 icmp_push_reply+0x30d/0x430 __icmp_send+0x1149/0x24f0 ip_options_compile+0xa2/0xd0 ip_rcv_finish_core+0x829/0x1950 ip_rcv+0x2d7/0x420 __netif_receive_skb_one_core+0x185/0x1f0 netif_receive_skb+0x90/0x450 tun_get_user+0x3413/0x3fb0 tun_chr_write_iter+0xe4/0x220 ... Fix this by checking rt2->rt_type after ip_route_input(). If it's RTN_LOCAL, the route cannot be used for output, so treat it as an error. The reproducer requires kernel modification to widen the race window, making it unsuitable as a selftest. It is available at: https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81 Reported-by: syzbot+e738404dcd14b620923c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000b1060905eada8881@google.com/T/ Closes: https://lore.kernel.org/r/20260128090523.356953-1-jiayuan.chen@linux.dev Fixes: 8b7817f3a959 ("[IPSEC]: Add ICMP host relookup support") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260206050220.59642-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net: Switch to skb_dstref_steal/skb_dstref_restore for ip_route_input callersStanislav Fomichev
[ Upstream commit e97e6a1830ddb5885ba312e56b6fa3aa39b5f47e ] Going forward skb_dst_set will assert that skb dst_entry is empty during skb_dst_set. skb_dstref_steal is added to reset existing entry without doing refcnt. skb_dstref_restore should be used to restore the previous entry. Convert icmp_route_lookup and ip_options_rcv_srr to these helpers. Add extra call to skb_dstref_reset to icmp_route_lookup to clear the ip_route_input entry. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-5-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 81b84de32bb2 ("xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net: atm: fix crash due to unvalidated vcc pointer in sigd_send()Jiayuan Chen
[ Upstream commit ae88a5d2f29b69819dc7b04086734439d074a643 ] Reproducer available at [1]. The ATM send path (sendmsg -> vcc_sendmsg -> sigd_send) reads the vcc pointer from msg->vcc and uses it directly without any validation. This pointer comes from userspace via sendmsg() and can be arbitrarily forged: int fd = socket(AF_ATMSVC, SOCK_DGRAM, 0); ioctl(fd, ATMSIGD_CTRL); // become ATM signaling daemon struct msghdr msg = { .msg_iov = &iov, ... }; *(unsigned long *)(buf + 4) = 0xdeadbeef; // fake vcc pointer sendmsg(fd, &msg, 0); // kernel dereferences 0xdeadbeef In normal operation, the kernel sends the vcc pointer to the signaling daemon via sigd_enq() when processing operations like connect(), bind(), or listen(). The daemon is expected to return the same pointer when responding. However, a malicious daemon can send arbitrary pointer values. Fix this by introducing find_get_vcc() which validates the pointer by searching through vcc_hash (similar to how sigd_close() iterates over all VCCs), and acquires a reference via sock_hold() if found. Since struct atm_vcc embeds struct sock as its first member, they share the same lifetime. Therefore using sock_hold/sock_put is sufficient to keep the vcc alive while it is being used. Note that there may be a race with sigd_close() which could mark the vcc with various flags (e.g., ATM_VF_RELEASED) after find_get_vcc() returns. However, sock_hold() guarantees the memory remains valid, so this race only affects the logical state, not memory safety. [1]: https://gist.github.com/mrpre/1ba5949c45529c511152e2f4c755b0f3 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+1f22cb1769f249df9fa0@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69039850.a70a0220.5b2ed.005d.GAE@google.com/T/ Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260205095501.131890-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>