summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2026-01-25net: core: neighbour: Make another netlink notification atomicallyPetr Machata
Similarly to the issue from the previous patch, neigh_timer_handler() also updates the neighbor separately from formatting and sending the netlink notification message. We have not seen reports to the effect of this causing trouble, but in theory, the same sort of issues could have come up: neigh_timer_handler() would make changes as necessary, but before formatting and sending a notification, is interrupted before sending by another thread, which makes a parallel change and sends its own message. The message send that is prompted by an earlier change thus contains information that does not reflect the change having been made. To solve this, the netlink notification needs to be in the same critical section that updates the neighbor. The critical section is ended by the neigh_probe() call which drops the lock before calling solicit. Stretching the critical section over the solicit call is problematic, because that can then involved all sorts of forwarding callbacks. Therefore, like in the previous patch, split the netlink notification away from the internal one and move it ahead of the probe call. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e440118511cbdbe1d88eb0d71c9047116feb96e0.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Make one netlink notification atomicallyPetr Machata
As noted in a previous patch, one race remains in the current code. A kernel thread might interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent <-------- neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent again The solution is to send the netlink message inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. To that end, in __neigh_update(), move the current neigh_notify() call up to said critical section, and convert it to __neigh_notify(), because the lock is held. This motion crosses calls to neigh_update_managed_list(), neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which potentially unlock and give an opportunity for the above race. This also crosses a call to neigh_update_process_arp_queue() which calls neigh->output(), which might be neigh_resolve_output() calls neigh_event_send() calls neigh_event_send_probe() calls __neigh_event_send() calls neigh_probe(), which touches neigh->probes, an update which will now not be visible in the notification. However, there is indication that there is no promise that these changes will be accurately projected to notifications: fib6_table_lookup() indirectly calls route.c's find_match() calls rt6_probe(), which looks up a neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0, but neither this nor the caller seems to send a notification. Additionally, the neighbor object that the neigh_probe() mentioned above is called on, might be the alternative neighbor looked up for the ARP queue packet destination. If that is the case, the changed value of n1->probes is not notified anywhere. So at least in some circumstances, the reported number of probes needs to be assumed to change without notification. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/ceb44995498eb52375cb2d46c3245bdb9e74b355.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Reorder netlink & internal notificationPetr Machata
The netlink message needs to be send inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. On the other hand, there is no such need in case of notifier chain: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. This requires that the netlink notification be done before the internal notifier chain message is sent. That is safe to do, because the current listeners, as well as __neigh_notify(), only read the updated neighbor fields, and never modify them. (Apart from locking.) Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/f3ef74d5460f14c4d102b8a5857d4a6624da9a5a.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Inline neigh_update_notify() callsPetr Machata
The obvious idea behind the helper is to keep together the two bits that should be done either both or neither: the internal notifier chain message, and the netlink notification. To make sure that the notification sent reflects the change being made, the netlink message needs to be send inside the critical section where the neighbor is changed. But for the notifier chain, there is no such need: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. Therefore these two items have each different locking needs. Now we could unlock inside the helper, but I find that error prone, and the fact that the notification is conditional in the first place does not help to make the call site obvious. So in this patch, the helper is instead removed and the body, which is just these two calls, inlined. That way we can use each notifier independently. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e65dce5882bc6f4aa2530b8a4877d0e003071a1a.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Process ARP queue laterPetr Machata
ARP queue processing unlocks the neighbor lock, which can allow another thread to asynchronously perform a neighbor update and send an out of order notification. Therefore this needs to be done after the notification is sent. Move it just before the end of the critical section. Since neigh_update_process_arp_queue() unlocks, it does not form a part of the critical section anymore but it can benefit from the lock being taken. The intention is to eventually do the RTNL notification before this call. This motion crosses a call to neigh_update_is_router(), which should not influence processing of the ARP queue. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/9ea7159e71430ebdc837ebcc880a76b7e82e52a4.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Extract ARP queue processing to a helper functionPetr Machata
In order to make manipulation with this bit of code clearer, extract it to a helper function, neigh_update_process_arp_queue(). Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/8b0fa0abe2cf0e24484903f5436fe0ac64163057.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Call __neigh_notify() under a lockPetr Machata
Andy Roulin has described an issue with the current neighbor notification scheme as follows. This was also presented publicly at the link below. neigh_update sends a rtnl notification if an update, e.g., nud_state change, was done but there is no guarantee of ordering of the rtnl notifications. Consider the following scenario: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = STALE read_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = REACHABLE read_unlock_bh(n->lock) rtnl_nofify RTNL REACHABLE sent <-------- rtnl_notify RTNL STALE sent In this scenario, the kernel neigh is updated first to STALE and then REACHABLE but the netlink notifications are sent out of order, first REACHABLE and then STALE. The solution is to send the netlink message inside the same critical section that formats the message. That way both the contents and ordering of the message reflect the same state, and we cannot see the abovementioned out-of-order delivery. Even with this patch, a remaining issue that the contents of the message may not reflect the changes made to the neighbor. A kernel thread might still interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents. The following patches will attempt to address that issue. To support those future patches, convert __neigh_notify() to a helper that assumes that the neighbor lock is already taken by having it call __neigh_fill_info() instead of neigh_fill_info(). Add a new helper, neigh_notify(), which takes the lock before calling __neigh_notify(). Migrate all callers to use the latter. Link: https://lore.kernel.org/netdev/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com/ Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/4b4368dcc5f5a7e407009cb6c36b69cfb5282864.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: core: neighbour: Add a neigh_fill_info() helper for when lock not heldPetr Machata
The netlink message needs to be formatted and sent inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. Because it will happen inside an already existing critical section, it has to assume that the neighbor lock is held. Add a helper __neigh_fill_info(), which is like neigh_fill_info(), but makes this assumption. Convert neigh_fill_info() to a wrapper around this new helper. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/7ec20113d5d809200e3534d3ed8f0004514914b8.1769012464.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25ipv4: igmp: annotate data-races around idev->mr_maxdelayEric Dumazet
idev->mr_maxdelay is read and written locklessly, add READ_ONCE()/WRITE_ONCE() annotations. While we are at it, make this field an u32. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: expand NETDEV_RSS_KEY_LEN to 256 bytesEric Dumazet
NETDEV_RSS_KEY_LEN has been set to 52 bytes in 2014, until now. Jakub suggested we bump the size to 128 bytes or more. Some drivers (like idpf) were already working around the core limit. Since this change might cause some issues in admin scripts, bump it directly to 256 in one go. tjbp26:~# cat /proc/sys/net/core/netdev_rss_key | wc -c 768 tjbp26:~# ethtool -x eth1 RX flow hash indirection table for eth1 with 32 RX ring(s): ... RSS hash key: fe:16:5b:2f:93:85:c2:c9:c1:ef:bd:60:c6:e0:2b:99:4d:bf:b7:14:c8:1e:8d:cb:31:17:51:da:55:eb:91:d9:9e:f9:89:9b:44:a1:dc:08:72:3a:b3:d6:31:86:9a:fe:02:3a:0d:eb:a1:7c:f5:a3:51:3b:08:56:c9:3f:71:69:01:ba:70:38 RSS hash function: toeplitz: on xor: off crc32: off Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20260122075206.504ec591@kernel.org/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260122190349.2771064-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: inline get_netmem() and put_netmem()Eric Dumazet
These helpers are used in network fast paths. Only call out-of-line helpers for netmem case. We might consider inlining __get_netmem() and __put_netmem() in the future. $ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4 add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968) Function old new delta pskb_carve 1669 1894 +225 gro_pull_from_frag0 - 206 +206 get_page 190 380 +190 skb_segment 3561 3747 +186 put_page 595 765 +170 skb_copy_ubufs 1683 1822 +139 __pskb_trim_head 276 401 +125 __pskb_copy_fclone 734 858 +124 skb_zerocopy 1092 1215 +123 pskb_expand_head 892 1008 +116 skb_split 828 940 +112 skb_release_data 297 409 +112 ___pskb_trim 829 941 +112 __skb_zcopy_downgrade_managed 120 226 +106 tcp_clone_payload 530 634 +104 esp_ssg_unref 191 294 +103 dev_gro_receive 1464 1514 +50 __put_netmem - 41 +41 __get_netmem - 41 +41 skb_shift 1139 1175 +36 skb_try_coalesce 681 714 +33 __pfx_put_page 112 144 +32 __pfx_get_page 32 64 +32 __pskb_pull_tail 1137 1168 +31 veth_xdp_get 250 267 +17 __pfx_gro_pull_from_frag0 - 16 +16 __pfx___put_netmem - 16 +16 __pfx___get_netmem - 16 +16 __pfx_put_netmem 16 - -16 __pfx_gro_try_pull_from_frag0 16 - -16 __pfx_get_netmem 16 - -16 put_netmem 114 - -114 get_netmem 130 - -130 napi_gro_frags 929 771 -158 gro_try_pull_from_frag0 196 - -196 Total: Before=22565857, After=22567825, chg +0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: inline net_is_devmem_iov()Eric Dumazet
1) Inline this small helper to reduce code size and decrease cpu costs. 2) Constify its argument. 3) Move it to include/net/netmem.h, as a prereq for the following patch. $ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3 add/remove: 0/2 grow/shrink: 0/4 up/down: 0/-158 (-158) Function old new delta validate_xmit_skb 866 857 -9 __pfx_net_is_devmem_iov 16 - -16 net_is_devmem_iov 22 - -22 get_netmem 152 130 -22 put_netmem 140 114 -26 tcp_recvmsg_locked 3860 3797 -63 Total: Before=22566015, After=22565857, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25gro: change the BUG_ON() in gro_pull_from_frag0()Eric Dumazet
Replace the BUG_ON() which never fired with a DEBUG_NET_WARN_ON_ONCE() $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 2/2 grow/shrink: 1/1 up/down: 370/-254 (116) Function old new delta gro_try_pull_from_frag0 - 196 +196 napi_gro_frags 771 929 +158 __pfx_gro_try_pull_from_frag0 - 16 +16 __pfx_gro_pull_from_frag0 16 - -16 dev_gro_receive 1514 1464 -50 gro_pull_from_frag0 188 - -188 Total: Before=22565899, After=22566015, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25ipv6: use the right ifindex when replying to icmpv6 from localhostFernando Fernandez Mancera
When replying to a ICMPv6 echo request that comes from localhost address the right output ifindex is 1 (lo) and not rt6i_idev dev index. Use the skb device ifindex instead. This fixes pinging to a local address from localhost source address. $ ping6 -I ::1 2001:1:1::2 -c 3 PING 2001:1:1::2 (2001:1:1::2) from ::1 : 56 data bytes 64 bytes from 2001:1:1::2: icmp_seq=1 ttl=64 time=0.037 ms 64 bytes from 2001:1:1::2: icmp_seq=2 ttl=64 time=0.069 ms 64 bytes from 2001:1:1::2: icmp_seq=3 ttl=64 time=0.122 ms 2001:1:1::2 ping statistics 3 packets transmitted, 3 received, 0% packet loss, time 2032ms rtt min/avg/max/mdev = 0.037/0.076/0.122/0.035 ms Fixes: 1b70d792cf67 ("ipv6: Use rt6i_idev index for echo replies to a local address") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260121194409.6749-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25net: bridge: mcast: fix memcpy with u64_statsDavid Yang
On 64bit arches, struct u64_stats_sync is empty and provides no help against load/store tearing. memcpy() should not be considered atomic against u64 values. Use u64_stats_copy() instead. Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260120092137.2161162-3-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-24bpf: add fsession supportMenglong Dong
The fsession is something that similar to kprobe session. It allow to attach a single BPF program to both the entry and the exit of the target functions. Introduce the struct bpf_fsession_link, which allows to add the link to both the fentry and fexit progs_hlist of the trampoline. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Co-developed-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260124062008.8657-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-23net/rds: rds_tcp_accept_one ought to not discard messagesGerd Rausch
RDS/TCP differs from RDS/RDMA in that message acknowledgment is done based on TCP sequence numbers: As soon as the last byte of a message has been acknowledged by the TCP stack of a peer, "rds_tcp_write_space()" goes on to discard prior messages from the send queue. Which is fine, for as long as the receiver never throws any messages away. Unfortunately, that is *not* the case since the introduction of MPRDS: commit 1a0e100fb2c96 "RDS: TCP: Enable multipath RDS for TCP" A new function "rds_tcp_accept_one_path" was introduced, which is entitled to return "NULL", if no connection path is currently available. Unfortunately, this happens after the "->accept()" call, and the new socket often already contains messages, since the peer already transitioned to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED". That's also the case after this [1]: commit 1a0e100fb2c96 "RDS: TCP: Force every connection to be initiated by numerically smaller IP address" which tried to address the situation of pending data by only transitioning connections from a smaller IP address to "RDS_CONN_UP". But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS" handshake has not occurred yet, and therefore we're working with "c_npaths <= 1", "c_conn[0]" may be in a state distinct from "RDS_CONN_DOWN", and therefore all messages on the just accepted socket will be tossed away. This fix changes "rds_tcp_accept_one": * If connected from a peer with a larger IP address, the new socket will continue to get closed right away. With commit [1] above, there should not be any messages in the socket receive buffer, since the peer never transitioned to "RDS_CONN_UP". Therefore it should be okay to not make any efforts to dispatch the socket receive buffer. * If connected from a peer with a smaller IP address, we call "rds_tcp_accept_one_path" to find a free slot/"path". If found, business goes on as usual. If none was found, we save/stash the newly accepted socket into "rds_tcp_accepted_sock", in order to not lose any messages that may have arrived already. We then return from "rds_tcp_accept_one" with "-ENOBUFS". Later on, when a slot/"path" does become available again (e.g. state transitioned to "RDS_CONN_DOWN", or HS extension header was received with "c_npaths > 1") we call "rds_tcp_conn_slots_available" that simply re-issues a "rds_tcp_accept_one_path" worker-callback and picks up the new socket from "rds_tcp_accepted_sock", and thereby continuing where it left with "-ENOBUFS" last time. Since a new slot has become available, those messages won't be lost, since processing proceeds as if that slot had been available the first time around. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net/rds: No shortcut out of RDS_CONN_ERRORGerd Rausch
RDS connections carry a state "rds_conn_path::cp_state" and transitions from one state to another and are conditional upon an expected state: "rds_conn_path_transition." There is one exception to this conditionality, which is "RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop" regardless of what state the condition is currently in. But as soon as a connection enters state "RDS_CONN_ERROR", the connection handling code expects it to go through the shutdown-path. The RDS/TCP multipath changes added a shortcut out of "RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING" via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change"). A subsequent "rds_tcp_reset_callbacks" can then transition the state to "RDS_CONN_RESETTING" with a shutdown-worker queued. That'll trip up "rds_conn_init_shutdown", which was never adjusted to handle "RDS_CONN_RESETTING" and subsequently drops the connection with the dreaded "DR_INV_CONN_STATE", which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever. So we do two things here: a) Don't shortcut "RDS_CONN_ERROR", but take the longer path through the shutdown code. b) Add "RDS_CONN_RESETTING" to the expected states in "rds_conn_init_shutdown" so that we won't error out and get stuck, if we ever hit weird state transitions like this again." Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260122055213.83608-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: add queue config validation callbackJakub Kicinski
I imagine (tm) that as the number of per-queue configuration options grows some of them may conflict for certain drivers. While the drivers can obviously do all the validation locally doing so is fairly inconvenient as the config is fed to drivers piecemeal via different ops (for different params and NIC-wide vs per-queue). Add a centralized callback for validating the queue config in queue ops. The callback gets invoked before memory provider is installed, and in the future should also be called when ring params are modified. The validation is done after each layer of configuration. Since we can't fail MP un-binding we must make sure that the config is valid both before and after MP overrides are applied. This is moot for now since the set of MP and device configs are disjoint. It will matter significantly in the future, so adding it now so that we don't forget.. Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: use netdev_queue_config() for mp restartJakub Kicinski
We should follow the prepare/commit approach for queue configuration. The qcfg struct should be added to dev->cfg rather than directly to queue objects so that we can clone and discard the pending config easily. Remove the qcfg in struct netdev_rx_queue, and switch remaining callers to netdev_queue_config(). netdev_queue_config() will construct the qcfg on the fly based on device defaults and state of the queue. ndo_default_qcfg becomes optional because having the callback itself does not have any meaningful semantics to us. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: move mp->rx_page_size validation to __net_mp_open_rxq()Jakub Kicinski
Move mp->rx_page_size validation where the rest of MP input validation lives. No other caller is modifying mp params so validation logic in queue restarts is out of place. Link: https://patch.msgid.link/20260122005113.2476634-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: introduce a trivial netdev_queue_config()Jakub Kicinski
We may choose to extend or reimplement the logic which renders the per-queue config. The drivers should not poke directly into the queue state. Add a helper for drivers to use when they want to query the config for a specific queue. Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23net: introduce mangleid_featuresPaolo Abeni
Some/most devices implementing gso_partial need to disable the GSO partial features when the IP ID can't be mangled; to that extend each of them implements something alike the following[1]: if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID)) features &= ~NETIF_F_TSO; in the ndo_features_check() op, which leads to a bit of duplicate code. Later patch in the series will implement GSO partial support for virtual devices, and the current status quo will require more duplicate code and a new indirect call in the TX path for them. Introduce the mangleid_features mask, allowing the core to disable NIC features based on/requiring MANGLEID, without any further intervention from the driver. The same functionality could be alternatively implemented adding a single boolean flag to the struct net_device, but would require an additional checks in ndo_features_check(). Also note that [1] is incorrect if the NIC additionally implements NETIF_F_GSO_UDP_L4, mangleid_features transparently handle even such a case. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/5a7cdaeea40b0a29b88e525b6c942d73ed3b8ce7.1769011015.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23tcp: move tcp_stream_memory_free() to tcp.cEric Dumazet
Moving tcp_stream_memory_free() to tcp.c allows the compiler to (auto)inline it from tcp_poll() and tcp_sendmsg_locked() for better performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 2/0 up/down: 118/0 (118) Function old new delta tcp_poll 840 923 +83 tcp_sendmsg_locked 4217 4252 +35 Total: Before=22573095, After=22573213, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260122090228.1678207-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.19-rc7). Conflicts: drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b35a6fd37a00 ("hinic3: Add adaptive IRQ coalescing with DIM") fb2bb2a1ebf7 ("hinic3: Fix netif_queue_set_napi queue_index input parameter error") https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com drivers/net/wireless/ath/ath12k/mac.c drivers/net/wireless/ath/ath12k/wifi7/hw.c 31707572108d ("wifi: ath12k: Fix wrong P2P device link id issue") c26f294fef2a ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module") https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au Adjacent changes: drivers/net/wireless/ath/ath12k/mac.c 8b8d6ee53dfd ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") 914c890d3b90 ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22tcp: move tcp_rate_check_app_limited() to tcp.cEric Dumazet
tcp_rate_check_app_limited() is used from tcp_sendmsg_locked() fast path and from other callers. Move it to tcp.c so that it can be inlined in tcp_sendmsg_locked(). Small increase of code, for better TCP performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/0 up/down: 87/0 (87) Function old new delta tcp_sendmsg_locked 4217 4304 +87 Total: Before=22566462, After=22566549, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22tcp: move tcp_rate_gen to tcp_input.cEric Dumazet
This function is called from one caller only, in TCP fast path. Move it to tcp_input.c so that compiler can inline it. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 226/-300 (-74) Function old new delta tcp_ack 5405 5631 +226 __pfx_tcp_rate_gen 16 - -16 tcp_rate_gen 284 - -284 Total: Before=22566536, After=22566462, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22Bluetooth: MGMT: Fix memory leak in set_ssp_completeJianpeng Chang
Fix memory leak in set_ssp_complete() where mgmt_pending_cmd structures are not freed after being removed from the pending list. Commit 302a1f674c00 ("Bluetooth: MGMT: Fix possible UAFs") replaced mgmt_pending_foreach() calls with individual command handling but missed adding mgmt_pending_free() calls in both error and success paths of set_ssp_complete(). Other completion functions like set_le_complete() were fixed correctly in the same commit. This causes a memory leak of the mgmt_pending_cmd structure and its associated parameter data for each SSP command that completes. Add the missing mgmt_pending_free(cmd) calls in both code paths to fix the memory leak. Also fix the same issue in set_advertising_complete(). Fixes: 302a1f674c00 ("Bluetooth: MGMT: Fix possible UAFs") Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-01-22netfilter: nft_set_rbtree: remove seqcount_rwlock_tPablo Neira Ayuso
After the conversion to binary search array, this is not required anymore. Remove it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22netfilter: nft_set_rbtree: use binary search array in get commandPablo Neira Ayuso
Rework .get interface to use the binary search array, this needs a specific lookup function to match on end intervals (<=). Packet path lookup is slight different because match is on lesser value, not equal (ie. <). After this patch, seqcount can be removed in a follow up patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22netfilter: nft_set_rbtree: translate rbtree to array for binary searchPablo Neira Ayuso
The rbtree can temporarily store overlapping inactive elements during the transaction processing, leading to false negative lookups. To address this issue, this patch adds a .commit function that walks the the rbtree to build a array of intervals of ordered elements. This conversion compacts the two singleton elements that represent the start and the end of the interval into a single interval object for space efficient. Binary search is O(log n), similar to rbtree lookup time, therefore, performance number should be similar, and there is an implementation available under lib/bsearch.c and include/linux/bsearch.h that is used for this purpose. This slightly increases memory consumption for this new array that stores pointers to the start and the end of the interval. With this patch: # time nft -f 100k-intervals-set.nft real 0m4.218s user 0m3.544s sys 0m0.400s Without this patch: # time nft -f 100k-intervals-set.nft real 0m3.920s user 0m3.547s sys 0m0.276s With this patch, with IPv4 intervals: baseline rbtree (match on first field only): 15254954pps Without this patch: baseline rbtree (match on first field only): 10256119pps This provides a ~50% improvement in matching intervals from packet path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22netfilter: nf_tables: add .abort_skip_removal flag for set typesPablo Neira Ayuso
The pipapo set backend is the only user of the .abort interface so far. To speed up pipapo abort path, removals are skipped. The follow up patch updates the rbtree to use to build an array of ordered elements, then use binary search. This needs a new .abort interface but, unlike pipapo, it also need to undo/remove elements. Add a flag and use it from the pipapo set backend. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22net/sched: act_ife: avoid possible NULL derefEric Dumazet
tcf_ife_encode() must make sure ife_encode() does not return NULL. syzbot reported: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:ife_tlv_meta_encode+0x41/0xa0 net/ife/ife.c:166 CPU: 3 UID: 0 PID: 8990 Comm: syz.0.696 Not tainted syzkaller #0 PREEMPT(full) Call Trace: <TASK> ife_encode_meta_u32+0x153/0x180 net/sched/act_ife.c:101 tcf_ife_encode net/sched/act_ife.c:841 [inline] tcf_ife_act+0x1022/0x1de0 net/sched/act_ife.c:877 tc_act include/net/tc_wrapper.h:130 [inline] tcf_action_exec+0x1c0/0xa20 net/sched/act_api.c:1152 tcf_exts_exec include/net/pkt_cls.h:349 [inline] mall_classify+0x1a0/0x2a0 net/sched/cls_matchall.c:42 tc_classify include/net/tc_wrapper.h:197 [inline] __tcf_classify net/sched/cls_api.c:1764 [inline] tcf_classify+0x7f2/0x1380 net/sched/cls_api.c:1860 multiq_classify net/sched/sch_multiq.c:39 [inline] multiq_enqueue+0xe0/0x510 net/sched/sch_multiq.c:66 dev_qdisc_enqueue+0x45/0x250 net/core/dev.c:4147 __dev_xmit_skb net/core/dev.c:4262 [inline] __dev_queue_xmit+0x2998/0x46c0 net/core/dev.c:4798 Fixes: 295a6e06d21e ("net/sched: act_ife: Change to use ife module") Reported-by: syzbot+5cf914f193dffde3bd3c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6970d61d.050a0220.706b.0010.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yotam Gigi <yotam.gi@gmail.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260121133724.3400020-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22Merge tag 'wireless-2026-11-22' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Another set of updates: - various small fixes for ath10k/ath12k/mwifiex/rsi - cfg80211 fix for HE bitrate overflow - mac80211 fixes - S1G beacon handling in scan - skb tailroom handling for HW encryption - CSA fix for multi-link - handling of disabled links during association * tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: cfg80211: ignore link disabled flag from userspace wifi: mac80211: apply advertised TTLM from association response wifi: mac80211: parse all TTLM entries wifi: mac80211: don't increment crypto_tx_tailroom_needed_cnt twice wifi: mac80211: don't perform DA check on S1G beacon wifi: ath12k: Fix wrong P2P device link id issue wifi: ath12k: fix dead lock while flushing management frames wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel wifi: ath12k: cancel scan only on active scan vdev wifi: mwifiex: Fix a loop in mwifiex_update_ampdu_rxwinsize() wifi: mac80211: correctly check if CSA is active wifi: cfg80211: Fix bitrate calculation overflow for HE rates wifi: rsi: Fix memory corruption due to not set vif driver data size wifi: ath12k: don't force radio frequency check in freq_to_idx() wifi: ath12k: fix dma_free_coherent() pointer wifi: ath10k: fix dma_free_coherent() pointer ==================== Link: https://patch.msgid.link/20260122110248.15450-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22vsock/virtio: cap TX credit to local buffer sizeMelbin K Mathew
The virtio transports derives its TX credit directly from peer_buf_alloc, which is set from the remote endpoint's SO_VM_SOCKETS_BUFFER_SIZE value. On the host side this means that the amount of data we are willing to queue for a connection is scaled by a guest-chosen buffer size, rather than the host's own vsock configuration. A malicious guest can advertise a large buffer and read slowly, causing the host to allocate a correspondingly large amount of sk_buff memory. The same thing would happen in the guest with a malicious host, since virtio transports share the same code base. Introduce a small helper, virtio_transport_tx_buf_size(), that returns min(peer_buf_alloc, buf_alloc), and use it wherever we consume peer_buf_alloc. This ensures the effective TX window is bounded by both the peer's advertised buffer and our own buf_alloc (already clamped to buffer_max_size via SO_VM_SOCKETS_BUFFER_MAX_SIZE), so a remote peer cannot force the other to queue more data than allowed by its own vsock settings. On an unpatched Ubuntu 22.04 host (~64 GiB RAM), running a PoC with 32 guest vsock connections advertising 2 GiB each and reading slowly drove Slab/SUnreclaim from ~0.5 GiB to ~57 GiB; the system only recovered after killing the QEMU process. That said, if QEMU memory is limited with cgroups, the maximum memory used will be limited. With this patch applied: Before: MemFree: ~61.6 GiB Slab: ~142 MiB SUnreclaim: ~117 MiB After 32 high-credit connections: MemFree: ~61.5 GiB Slab: ~178 MiB SUnreclaim: ~152 MiB Only ~35 MiB increase in Slab/SUnreclaim, no host OOM, and the guest remains responsive. Compatibility with non-virtio transports: - VMCI uses the AF_VSOCK buffer knobs to size its queue pairs per socket based on the local vsk->buffer_* values; the remote side cannot enlarge those queues beyond what the local endpoint configured. - Hyper-V's vsock transport uses fixed-size VMBus ring buffers and an MTU bound; there is no peer-controlled credit field comparable to peer_buf_alloc, and the remote endpoint cannot drive in-flight kernel memory above those ring sizes. - The loopback path reuses virtio_transport_common.c, so it naturally follows the same semantics as the virtio transport. This change is limited to virtio_transport_common.c and thus affects virtio-vsock, vhost-vsock, and loopback, bringing them in line with the "remote window intersected with local policy" behaviour that VMCI and Hyper-V already effectively have. Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Melbin K Mathew <mlbnkm1@gmail.com> [Stefano: small adjustments after changing the previous patch] [Stefano: tweak the commit message] Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20260121093628.9941-4-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-22vsock/virtio: fix potential underflow in virtio_transport_get_credit()Melbin K Mathew
The credit calculation in virtio_transport_get_credit() uses unsigned arithmetic: ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt); If the peer shrinks its advertised buffer (peer_buf_alloc) while bytes are in flight, the subtraction can underflow and produce a large positive value, potentially allowing more data to be queued than the peer can handle. Reuse virtio_transport_has_space() which already handles this case and add a comment to make it clear why we are doing that. Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Melbin K Mathew <mlbnkm1@gmail.com> [Stefano: use virtio_transport_has_space() instead of duplicating the code] [Stefano: tweak the commit message] Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20260121093628.9941-2-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-22net: openvswitch: fix data race in ovs_vport_get_upcall_statsDavid Yang
In ovs_vport_get_upcall_stats(), some statistics protected by u64_stats_sync, are read and accumulated in ignorance of possible u64_stats_fetch_retry() events. These statistics are already accumulated by u64_stats_inc(). Fix this by reading them into temporary variables first. Fixes: 1933ea365aa7 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: David Yang <mmyangfl@gmail.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20260121072932.2360971-1-mmyangfl@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-22cipso: harden use of skb_cow() in cipso_v4_skbuff_setattr()Will Rosenberg
If skb_cow() is passed a headroom <= -NET_SKB_PAD, it will trigger a BUG. As a result, use cases should avoid calling with a headroom that is negative to prevent triggering this issue. This is the same code pattern fixed in Commit 58fc7342b529 ("ipv6: BUG() in pskb_expand_head() as part of calipso_skbuff_setattr()"). In cipso_v4_skbuff_setattr(), len_delta can become negative, leading to a negative headroom passed to skb_cow(). However, the BUG is not triggerable because the condition headroom <= -NET_SKB_PAD cannot be satisfied due to limits on the IPv4 options header size. Avoid potential problems in the future by only using skb_cow() to grow the skb headroom. Signed-off-by: Will Rosenberg <whrosenb@asu.edu> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20260120155738.982771-1-whrosenb@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-21Merge tag 'nf-next-26-01-20' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== Subject: netfilter: updates for net-next 1) Speed up nftables transactions after earlier transaction failed. Due to a (harmeless) bug we remained in slow paranoia mode until a successful transaction completes. 2) Allow generic tracker to resolve clashes, this avoids very rare packet drops. From Yuto Hamaguchi. 3) Increase the cleanup budget to 64 entries in nf_conncount to reap more entries in one go, from Fernando Fernandez Mancera. 4) Allow icmp trackers to resolve clashes, this avoids very rare initial packet drop with test cases that have high-frequency pings. After this all trackers except tcp and sctp allow clash resolution. 5) Disentangle netfilter headers, don't include nftables/xtables headers in subsystems that are unrelated. 6) Don't rely on implicit includes coming from nf_conntrack_proto_gre.h. 7) Allow nfnetlink_queue nfq instance struct to get accounted via memcg, from Scott Mitchell. 8) Reject bogus xt target/match data upfront via netlink policiy in nft_compat interface rather than relying on x_tables API to do it. 9) Fix nf_conncount breakage when trying to limit loopback flows via prerouting rule, from Fernando Fernandez Mancera. This is a recent breakage but not seen as urgent enough to rush this via net tree at this late stage in development cycle. 10) Fix a possible off-by-one when parsing tcp option in xtables tcpmss match. Also handled via -next due to late stage in development cycle. * tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: xt_tcpmss: check remaining length before reading optlen netfilter: nf_conncount: fix tracking of connections from localhost netfilter: nft_compat: add more restrictions on netlink attributes netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation netfilter: nf_conntrack: don't rely on implicit includes netfilter: don't include xt and nftables.h in unrelated subsystems netfilter: nf_conntrack: enable icmp clash support netfilter: nf_conncount: increase the connection clean up limit to 64 netfilter: nf_conntrack: Add allow_clash to generic protocol handler netfilter: nf_tables: reset table validation state on abort ==================== Link: https://patch.msgid.link/20260120191803.22208-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21rxrpc: Fix data-race warning and potential load/store tearingDavid Howells
Fix the following: BUG: KCSAN: data-race in rxrpc_peer_keepalive_worker / rxrpc_send_data_packet which is reporting an issue with the reads and writes to ->last_tx_at in: conn->peer->last_tx_at = ktime_get_seconds(); and: keepalive_at = peer->last_tx_at + RXRPC_KEEPALIVE_TIME; The lockless accesses to these to values aren't actually a problem as the read only needs an approximate time of last transmission for the purposes of deciding whether or not the transmission of a keepalive packet is warranted yet. Also, as ->last_tx_at is a 64-bit value, tearing can occur on a 32-bit arch. Fix both of these by switching to an unsigned int for ->last_tx_at and only storing the LSW of the time64_t. It can then be reconstructed at need provided no more than 68 years has elapsed since the last transmission. Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive") Reported-by: syzbot+6182afad5045e6703b3d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/695e7cfb.050a0220.1c677c.036b.GAE@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/1107124.1768903985@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21net: dsa: fix off-by-one in maximum bridge ID determinationVladimir Oltean
Prior to the blamed commit, the bridge_num range was from 0 to ds->max_num_bridges - 1. After the commit, it is from 1 to ds->max_num_bridges. So this check: if (bridge_num >= max) return 0; must be updated to: if (bridge_num > max) return 0; in order to allow the last bridge_num value (==max) to be used. This is easiest visible when a driver sets ds->max_num_bridges=1. The observed behaviour is that even the first created bridge triggers the netlink extack "Range of offloadable bridges exceeded" warning, and is handled in software rather than being offloaded. Fixes: 3f9bb0301d50 ("net: dsa: make dp->bridge_num one-based") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20260120211039.3228999-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21gro: inline tcp6_gro_complete()Eric Dumazet
Remove one function call from GRO stack for native IPv6 + TCP packets. $ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3 add/remove: 0/0 grow/shrink: 1/1 up/down: 298/-5 (293) Function old new delta ipv6_gro_complete 435 733 +298 tcp6_gro_complete 311 306 -5 Total: Before=22593532, After=22593825, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260120164903.1912995-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21gro: inline tcp6_gro_receive()Eric Dumazet
FDO/LTO are unable to inline tcp6_gro_receive() from ipv6_gro_receive() Make sure tcp6_check_fraglist_gro() is only called only when needed, so that compiler can leave it out-of-line. $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 2/0 grow/shrink: 3/1 up/down: 1123/-253 (870) Function old new delta ipv6_gro_receive 1069 1846 +777 tcp6_check_fraglist_gro - 272 +272 ipv6_offload_init 218 274 +56 __pfx_tcp6_check_fraglist_gro - 16 +16 ipv6_gro_complete 433 435 +2 tcp6_gro_receive 959 706 -253 Total: Before=22592662, After=22593532, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260120164903.1912995-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-21tcp: preserve const qualifier in tcp_rsk() and inet_rsk()Eric Dumazet
We can change tcp_rsk() and inet_rsk() to propagate their argument const qualifier thanks to container_of_const(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260120125353.1470456-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20kernel.h: drop hex.h and update all hex.h usersRandy Dunlap
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of hex.h interfaces to directly #include <linux/hex.h> as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of <linux/hex.h> in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20netrom: fix double-free in nr_route_frame()Jeongjun Park
In nr_route_frame(), old_skb is immediately freed without checking if nr_neigh->ax25 pointer is NULL. Therefore, if nr_neigh->ax25 is NULL, the caller function will free old_skb again, causing a double-free bug. Therefore, to prevent this, we need to modify it to check whether nr_neigh->ax25 is NULL before freeing old_skb. Cc: <stable@vger.kernel.org> Reported-by: syzbot+999115c3bf275797dc27@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69694d6f.050a0220.58bed.0029.GAE@google.com/ Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Link: https://patch.msgid.link/20260119063359.10604-1-aha310510@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: remove HIPPI support and RoadRunner HIPPI driverEthan Nelson-Moore
HIPPI has not been relevant for over two decades. It was rapidly eclipsed by Fibre Channel, and even when it was new, it was confined to very high-end hardware. The HIPPI code has only received tree-wide changes and fixes by inspection in the entire Git history. Remove HIPPI support and the rrunner HIPPI driver, and move the former maintainer to the CREDITS file. Keep the include/uapi/linux/if_hippi.h header because it is used by the TUN code, and to avoid breaking userspace, however unlikely that may be. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Link: https://patch.msgid.link/20260119022451.22344-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20tcp: move tcp_rate_skb_delivered() to tcp_input.cEric Dumazet
tcp_rate_skb_delivered() is only called from tcp_input.c. Move it there and make it static. Both gcc and clang are (auto)inlining it, TCP performance is increased at a small space cost. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 3/0 up/down: 509/-187 (322) Function old new delta tcp_sacktag_walk 1682 1867 +185 tcp_ack 5230 5405 +175 tcp_shifted_skb 437 586 +149 __pfx_tcp_rate_skb_delivered 16 - -16 tcp_rate_skb_delivered 171 - -171 Total: Before=22566192, After=22566514, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260118123204.2315993-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20ipv6: annotate data-race in ndisc_router_discovery()Eric Dumazet
syzbot found that ndisc_router_discovery() could read and write in6_dev->ra_mtu without holding a lock [1] This looks fine, IFLA_INET6_RA_MTU is best effort. Add READ_ONCE()/WRITE_ONCE() to document the race. Note that we might also reject illegal MTU values (mtu < IPV6_MIN_MTU || mtu > skb->dev->mtu) in a future patch. [1] BUG: KCSAN: data-race in ndisc_router_discovery / ndisc_router_discovery read to 0xffff888119809c20 of 4 bytes by task 25817 on cpu 1: ndisc_router_discovery+0x151d/0x1c90 net/ipv6/ndisc.c:1558 ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841 icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989 ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438 ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489 NF_HOOK include/linux/netfilter.h:318 [inline] ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500 ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590 dst_input include/net/dst.h:474 [inline] ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79 ... write to 0xffff888119809c20 of 4 bytes by task 25816 on cpu 0: ndisc_router_discovery+0x155a/0x1c90 net/ipv6/ndisc.c:1559 ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841 icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989 ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438 ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489 NF_HOOK include/linux/netfilter.h:318 [inline] ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500 ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590 dst_input include/net/dst.h:474 [inline] ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79 ... value changed: 0x00000000 -> 0xe5400659 Fixes: 49b99da2c9ce ("ipv6: add IFLA_INET6_RA_MTU to expose mtu value") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rocco Yue <rocco.yue@mediatek.com> Link: https://patch.msgid.link/20260118152941.2563857-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: add kdoc for napi_consume_skb()Jakub Kicinski
Looks like AI reviewers miss that napi_consume_skb() must have a real budget passed to it. Let's see if adding a real kdoc will help them figure this out. Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20260119224140.1362729-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>