summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2025-09-14net: use NUMA drop counters for softnet_data.droppedEric Dumazet
Hosts under DOS attack can suffer from false sharing in enqueue_to_backlog() : atomic_inc(&sd->dropped). This is because sd->dropped can be touched from many cpus, possibly residing on different NUMA nodes. Generalize the sk_drop_counters infrastucture added in commit c51613fa276f ("net: add sk->sk_drop_counters") and use it to replace softnet_data.dropped with NUMA friendly softnet_data.drop_counters. This adds 64 bytes per cpu, maybe more in the future if we increase the number of counters (currently 2) per 'struct numa_drop_counters'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250909121942.1202585-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11tcp: Free TCP-AO/TCP-MD5 info/keys without RCUDmitry Safonov
Now that the destruction of info/keys is delayed until the socket destructor, it's safe to use kfree() without an RCU callback. The socket is in TCP_CLOSE state either because it never left it, or it's already closed and the refcounter is zero. In any way, no one can discover it anymore, it's safe to release memory straight away. Similar thing was possible for twsk already. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-2-9ffaaaf8b236@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()Dmitry Safonov
Currently there are a couple of minor issues with destroying the keys tcp_v4_destroy_sock(): 1. The socket is yet in TCP bind buckets, making it reachable for incoming segments [on another CPU core], potentially available to send late FIN/ACK/RST replies. 2. There is at least one code path, where tcp_done() is called before sending RST [kudos to Bob for investigation]. This is a case of a server, that finished sending its data and just called close(). The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by __tcp_close()) tcp_v4_do_rcv()/tcp_v6_do_rcv() tcp_rcv_state_process() /* LINUX_MIB_TCPABORTONDATA */ tcp_reset() tcp_done_with_error() tcp_done() inet_csk_destroy_sock() /* Destroys AO/MD5 keys */ /* tcp_rcv_state_process() returns SKB_DROP_REASON_TCP_ABORT_ON_DATA */ tcp_v4_send_reset() /* Sends an unsigned RST segment */ tcpdump: > 22:53:15.399377 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 33929, offset 0, flags [DF], proto TCP (6), length 60) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [F.], seq 2185658590, ack 3969644355, win 502, options [nop,nop,md5 valid], length 0 > 22:53:15.399396 00:00:01:01:00:00 > 00:00:b2:1f:00:00, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 51951, offset 0, flags [DF], proto TCP (6), length 72) > 1.0.0.2.49848 > 1.0.0.1.34567: Flags [.], seq 3969644375, ack 2185658591, win 128, options [nop,nop,md5 valid,nop,nop,sack 1 {2185658590:2185658591}], length 0 > 22:53:16.429588 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658590, win 0, length 0 > 22:53:16.664725 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0 > 22:53:17.289832 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0 Note the signed RSTs later in the dump - those are sent by the server when the fin-wait socket gets removed from hash buckets, by the listener socket. Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(), slightly delay it until the actual socket .sk_destruct(). As shutdown'ed socket can yet send non-data replies, they should be signed in order for the peer to process them. Now it also matches how AO/MD5 gets destructed for TIME-WAIT sockets (in tcp_twsk_destructor()). This seems optimal for TCP-MD5, while for TCP-AO it seems to have an open problem: once RST get sent and socket gets actually destructed, there is no information on the initial sequence numbers. So, in case this last RST gets lost in the network, the server's listener socket won't be able to properly sign another RST. Nothing in RFC 1122 prescribes keeping any local state after non-graceful reset. Luckily, BGP are known to use keep alive(s). While the issue is quite minor/cosmetic, these days monitoring network counters is a common practice and getting invalid signed segments from a trusted BGP peer can get customers worried. Investigated-by: Bob Gilligan <gilligan@arista.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-1-9ffaaaf8b236@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11Merge tag 'wireless-next-2025-09-11' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Plenty of things going on, notably: - iwlwifi: major cleanups/rework - brcmfmac: gets AP isolation support - mac80211: gets more S1G support * tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (94 commits) wifi: mwifiex: fix endianness handling in mwifiex_send_rgpower_table wifi: cfg80211: Remove the redundant wiphy_dev wifi: mac80211: fix incorrect comment wifi: cfg80211: update the time stamps in hidden ssid wifi: mac80211: Fix HE capabilities element check wifi: mac80211: add tx_handlers_drop statistics to ethtool wifi: mac80211: fix reporting of all valid links in sta_set_sinfo() wifi: iwlwifi: mld: CHANNEL_SURVEY_NOTIF is always supported wifi: iwlwifi: mld: remove support of iwl_esr_mode_notif version 1 wifi: iwlwifi: mld: remove support from of sta cmd version 1 wifi: iwlwifi: mld: remove support of roc cmd version 5 wifi: iwlwifi: mld: remove support of mac cmd ver 2 wifi: iwlwifi: mld: don't consider phy cmd version 5 wifi: iwlwifi: implement wowlan status notification API update wifi: iwlwifi: fw: Add ASUS to PPAG and TAS list wifi: iwlwifi: add kunit tests for nvm parse wifi: iwlwifi: api: add a flag to iwl_link_ctx_modify_flags wifi: iwlwifi: pcie: move ltr_enabled to the specific transport wifi: iwlwifi: pcie: move pm_support to the specific transport wifi: iwlwifi: rename iwl_finish_nic_init ... ==================== Link: https://patch.msgid.link/20250911100854.20445-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") 84c1da7b38d9 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11ipv4: icmp: Pass IPv4 control block structure as an argument to __icmp_send()Ido Schimmel
__icmp_send() is used to generate ICMP error messages in response to various situations such as MTU errors (i.e., "Fragmentation Required") and too many hops (i.e., "Time Exceeded"). The skb that generated the error does not necessarily come from the IPv4 layer and does not always have a valid IPv4 control block in skb->cb. Therefore, commit 9ef6b42ad6fd ("net: Add __icmp_send helper.") changed the function to take the IP options structure as argument instead of deriving it from the skb's control block. Some callers of this function such as icmp_send() pass the IP options structure from the skb's control block as in these call paths the control block is known to be valid, but other callers simply pass a zeroed structure. A subsequent patch will need __icmp_send() to access more information from the IPv4 control block (specifically, the ifindex of the input interface). As a preparation for this change, change the function to take the IPv4 control block structure as an argument instead of the IP options structure. This makes the function similar to its IPv6 counterpart that already takes the IPv6 control block structure as an argument. No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-3-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-11net: xdp: handle frags with unreadable memoryJakub Kicinski
We don't expect frags with unreadable memory to be presented to XDP programs today, but the XDP helpers are designed to be usable whether XDP is enabled or not. Support handling frags with unreadable memory. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250905221539.2930285-3-kuba@kernel.org Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-11net: xdp: pass full flags to xdp_update_skb_shared_info()Jakub Kicinski
xdp_update_skb_shared_info() needs to update skb state which was maintained in xdp_buff / frame. Pass full flags into it, instead of breaking it out bit by bit. We will need to add a bit for unreadable frags (even tho XDP doesn't support those the driver paths may be common), at which point almost all call sites would become: xdp_update_skb_shared_info(skb, num_frags, sinfo->xdp_frags_size, MY_PAGE_SIZE * num_frags, xdp_buff_is_frag_pfmemalloc(xdp), xdp_buff_is_frag_unreadable(xdp)); Keep a helper for accessing the flags, in case we need to transform them somehow in the future (e.g. to cover up xdp_buff vs xdp_frame differences). While we are touching call callers - rename the helper to xdp_update_skb_frags_info(), previous name may have implied that it's shinfo that's updated. We are updating flags in struct sk_buff based on frags that got attched. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-10netfilter: nf_tables: make nft_set_do_lookup available unconditionallyFlorian Westphal
This function was added for retpoline mitigation and is replaced by a static inline helper if mitigations are not enabled. Enable this helper function unconditionally so next patch can add a lookup restart mechanism to fix possible false negatives while transactions are in progress. Adding lookup restarts in nft_lookup_eval doesn't work as nft_objref would then need the same copypaste loop. This patch is separate to ease review of the actual bug fix. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-10netfilter: nf_tables: place base_seq in struct netFlorian Westphal
This will soon be read from packet path around same time as the gencursor. Both gencursor and base_seq get incremented almost at the same time, so it makes sense to place them in the same structure. This doesn't increase struct net size on 64bit due to padding. Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-09devlink: Add 'total_vfs' generic device paramVlad Dumitrescu
NICs are typically configured with total_vfs=0, forcing users to rely on external tools to enable SR-IOV (a widely used and essential feature). Add total_vfs parameter to devlink for SR-IOV max VF configurability. Enables standard kernel tools to manage SR-IOV, addressing the need for flexible VF configuration. Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09Merge branch '200GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== idpf: add XDP support Alexander Lobakin says: Add XDP support (w/o XSk for now) to the idpf driver using the libeth_xdp sublib. All possible verdicts, .ndo_xdp_xmit(), multi-buffer etc. are here. In general, nothing outstanding comparing to ice, except performance -- let's say, up to 2x for .ndo_xdp_xmit() on certain platforms and scenarios. idpf doesn't support VLAN Rx offload, so only the hash hint is available for now. Patches 1-7 are prereqs, without which XDP would either not work at all or work slower/worse/... * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: idpf: add XDP RSS hash hint idpf: add support for .ndo_xdp_xmit() idpf: add support for XDP on Rx idpf: use generic functions to build xdp_buff and skb idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq idpf: prepare structures to support XDP idpf: add support for nointerrupt queues idpf: remove SW marker handling from NAPI idpf: add 4-byte completion descriptor definition idpf: link NAPIs to queues idpf: use a saner limit for default number of queues to allocate idpf: fix Rx descriptor ready check barrier in splitq xdp, libeth: make the xdp_init_buff() micro-optimization generic ==================== Link: https://patch.msgid.link/20250908195748.1707057-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09bonding: support aggregator selection based on port priorityHangbin Liu
Add a new ad_select policy 'port_priority' that uses the per-port actor priority values (set via ad_actor_port_prio) to determine aggregator selection. This allows administrators to influence which ports are preferred for aggregation by assigning different priority values, providing more flexible load balancing control in LACP configurations. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09bonding: add support for per-port LACP actor priorityHangbin Liu
Introduce a new netlink attribute 'actor_port_prio' to allow setting the LACP actor port priority on a per-slave basis. This extends the existing bonding infrastructure to support more granular control over LACP negotiations. The priority value is embedded in LACPDU packets and will be used by subsequent patches to influence aggregator selection policies. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-08net: snmp: remove SNMP_MIB_SENTINELEric Dumazet
No more user of SNMP_MIB_SENTINEL, we can remove it. Also remove snmp_get_cpu_field[64]_batch() helpers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08ipv6: snmp: do not use SNMP_MIB_SENTINEL anymoreEric Dumazet
Use ARRAY_SIZE(), so that we know the limit at compile time. Following patch needs this preliminary change. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08xdp, libeth: make the xdp_init_buff() micro-optimization genericAlexander Lobakin
Often times the compilers are not able to expand two consecutive 32-bit writes into one 64-bit on the corresponding architectures. This applies to xdp_init_buff() called for every received frame (or at least once per each 64 frames when the frag size is fixed). Move the not-so-pretty hack from libeth_xdp straight to xdp_init_buff(), but using a proper union around ::frame_sz and ::flags. The optimization is limited to LE architectures due to the structure layout. One simple example from idpf with the XDP series applied (Clang 22-git, CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE => -O2): add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27) Function old new delta idpf_vport_splitq_napi_poll 5076 5049 -27 The perf difference with XDP_DROP is around +0.8-1% which I see as more than satisfying. Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Ramu R <ramu.r@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h c51613fa276f ("net: add sk->sk_drop_counters") 5d6b58c932ec ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04wifi: nl80211: allow drivers to support subset of NL80211_CMD_SET_BSSArend van Spriel
The so-called fullmac devices rely on firmware functionality and/or API to change BSS parameters. Today there are limited drivers supporting the nl80211 primitive, but they only handle a subset of the bss parameters passed if any. The mac80211 driver does handle all parameters and stores their configured values. Some of the BSS parameters were already conditional by wiphy->features. For these the wiphy->bss_param_support and wiphy->features fields are silently aligned in wiphy_register(). Maybe better to issue a warning instead when they are misaligned. Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com> Link: https://patch.msgid.link/20250817190435.1495094-2-arend.vanspriel@broadcom.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-04wifi: nl80211: Add EHT fixed Tx rate supportMuna Sinada
Add new attributes to support EHT MCS/NSS Tx rates and EHT GI/LTF. Parse EHT fixed MCS/NSS Tx rates and EHT GI/LTF values passed by the userspace, validate and add as part of cfg80211_bitrate_mask. MCS mask is constructed by new function, eht_build_mcs_mask(). Max NSS supported for MCS rates of 7, 9, 11 and 13 is utilized to set MCS bitmask for each NSS. MCS rates 14, and 15 if supported, are set only for NSS = 0. Co-developed-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com> Signed-off-by: Aloka Dixit <aloka.dixit@oss.qualcomm.com> Signed-off-by: Muna Sinada <muna.sinada@oss.qualcomm.com> Link: https://patch.msgid.link/20250815213011.2704803-1-muna.sinada@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-04wifi: cfg80211: fix return value in cfg80211_get_radio_idx_by_chan()Aditya Kumar Singh
If a valid radio index is not found, the function returns -ENOENT. If the channel argument itself is invalid, it returns -EINVAL. However, since the caller only checks for < 0, the distinction between these error codes is not utilized much. Also, handling these two distinct error codes throughout the codebase adds complexity, as both cases must be addressed separately. A subsequent change aims to simplify this by using a single error code for all invalid cases, making error handling more consistent and streamlined. To support this change, update the return value to -EINVAL when a valid radio index is not found. This is still appropriate because, even if the channel argument is structurally valid, the absence of a corresponding radio index implies that the argument is effectively invalid—otherwise, a valid index would have been found. Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Link: https://patch.msgid.link/20250812-fix_scan_ap_flag_requirement_during_mlo-v4-1-383ffb6da213@oss.qualcomm.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-09-04net: add helper to pre-check if PP for an Rx queue will be unreadableJakub Kicinski
mlx5 pokes into the rxq state to check if the queue has a memory provider, and therefore whether it may produce unreadable mem. Add a helper for doing this in the page pool API. fbnic will want a similar thing (tho, for a slightly different reason). Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250901211214.1027927-11-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-03net: nfc: nci: Increase NCI_DATA_TIMEOUT to 3000 msJuraj Šarinay
An exchange with a NFC target must complete within NCI_DATA_TIMEOUT. A delay of 700 ms is not sufficient for cryptographic operations on smart cards. CardOS 6.0 may need up to 1.3 seconds to perform 256-bit ECDH or 3072-bit RSA. To prevent brute-force attacks, passports and similar documents introduce even longer delays into access control protocols (BAC/PACE). The timeout should be higher, but not too much. The expiration allows us to detect that a NFC target has disappeared. Signed-off-by: Juraj Šarinay <juraj@sarinay.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/20250902113630.62393-1-juraj@sarinay.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03net: lockless sock_i_ino()Eric Dumazet
Followup of commit c51da3f7a161 ("net: remove sock_i_uid()") A recent syzbot report was the trigger for this change. Over the years, we had many problems caused by the read_lock[_bh](&sk->sk_callback_lock) in sock_i_uid(). We could fix smc_diag_dump_proto() or make a more radical move: Instead of waiting for new syzbot reports, cache the socket inode number in sk->sk_ino, so that we no longer need to acquire sk->sk_callback_lock in sock_i_ino(). This makes socket dumps faster (one less cache line miss, and two atomic ops avoided). Prior art: commit 25a9c8a4431c ("netlink: Add __sock_i_ino() for __netlink_diag_dump().") commit 4f9bf2a2f5aa ("tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.") commit efc3dbc37412 ("rds: Make rds_sock_lock BH rather than IRQ safe.") Fixes: d2d6422f8bd1 ("x86: Allow to enable PREEMPT_RT.") Reported-by: syzbot+50603c05bbdf4dfdaffa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68b73804.050a0220.3db4df.01d8.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250902183603.740428-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03Merge tag 'nf-next-25-09-02' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) prefer vmalloc_array in ebtables, from Qianfeng Rong. 2) Use csum_replace4 instead of open-coding it, from Christophe Leroy. 3+4) Get rid of GFP_ATOMIC in transaction object allocations, those cause silly failures with large sets under memory pressure, from myself. 5) Remove test for AVX cpu feature in nftables pipapo set type, testing for AVX2 feature is sufficient. 6) Unexport a few function in nf_reject infra: no external callers. 7) Extend payload offset to u16, this was restricted to values <=255 so far, from Fernando Fernandez Mancera. * tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_payload: extend offset to 65535 bytes netfilter: nf_reject: remove unneeded exports netfilter: nft_set_pipapo: remove redundant test for avx feature bit netfilter: nf_tables: all transaction allocations can now sleep netfilter: nf_tables: allow iter callbacks to sleep netfilter: nft_payload: Use csum_replace4() instead of opencoding netfilter: ebtables: Use vmalloc_array() to improve code ==================== Link: https://patch.msgid.link/20250902133549.15945-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03genetlink: fix typo in commentAsbjørn Sloth Tønnesen
In this context "not that ..." should properly be "note that ...". Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250902154640.759815-4-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-03net: Add rfs_needed() helperChristoph Paasch
Add a helper to check if RFS is needed or not. Allows to make the code a bit cleaner and the next patch to have MPTCP use this helper to decide whether or not to iterate over the subflows. tun_flow_update() was calling sock_rps_record_flow_hash() regardless of the state of rfs_needed. This was not really a bug as sock_flow_table simply ends up being NULL and thus everything will be fine. This commit here thus also implicitly makes tun_flow_update() respect the state of rfs_needed. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Christoph Paasch <cpaasch@openai.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250902-net-next-mptcp-misc-feat-6-18-v2-3-fa02bb3188b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02net_sched: act: remove tcfa_qstatsEric Dumazet
tcfa_qstats is currently only used to hold drops and overlimits counters. tcf_action_inc_drop_qstats() and tcf_action_inc_overlimit_qstats() currently acquire a->tcfa_lock to increment these counters. Switch to two atomic_t to get lock-free accounting. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250901093141.2093176-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02bonding: Remove support for use_carrierJay Vosburgh
Remove the implementation of use_carrier, the link monitoring method that utilizes ethtool or ioctl to determine the link state of an interface in a bond. Bonding will always behaves as if use_carrier=1, which relies on netif_carrier_ok() to determine the link state of interfaces. To avoid acquiring RTNL many times per second, bonding inspects link state under RCU, but not under RTNL. However, ethtool implementations in drivers may sleep, and therefore this strategy is unsuitable for use with calls into driver ethtool functions. The use_carrier option was introduced in 2003, to provide backwards compatibility for network device drivers that did not support the then-new netif_carrier_ok/on/off system. Device drivers are now expected to support netif_carrier_*, and the use_carrier backwards compatibility logic is no longer necessary. The option itself remains, but when queried always returns 1, and may only be set to 1. Link: https://lore.kernel.org/000000000000eb54bf061cfd666a@google.com Link: https://lore.kernel.org/20240718122017.d2e33aaac43a.I10ab9c9ded97163aef4e4de10985cd8f7de60d28@changeid Signed-off-by: Jay Vosburgh <jv@jvosburgh.net> Reported-by: syzbot+b8c48ea38ca27d150063@syzkaller.appspotmail.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/2029487.1756512517@famine Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02netfilter: nft_payload: extend offset to 65535 bytesFernando Fernandez Mancera
In some situations 255 bytes offset is not enough to match or manipulate the desired packet field. Increase the offset limit to 65535 or U16_MAX. In addition, the nla policy maximum value is not set anymore as it is limited to s16. Instead, the maximum value is checked during the payload expression initialization function. Tested with the nft command line tool. table ip filter { chain output { @nh,2040,8 set 0xff @nh,524280,8 set 0xff @nh,524280,8 0xff @nh,2040,8 0xff } } Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02netfilter: nf_reject: remove unneeded exportsFlorian Westphal
These functions have no external callers and can be static. Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-02netfilter: nf_tables: allow iter callbacks to sleepFlorian Westphal
Quoting Sven Auhagen: we do see on occasions that we get the following error message, more so on x86 systems than on arm64: Error: Could not process rule: Cannot allocate memory delete table inet filter It is not a consistent error and does not happen all the time. We are on Kernel 6.6.80, seems to me like we have something along the lines of the nf_tables: allow clone callbacks to sleep problem using GFP_ATOMIC. As hinted at by Sven, this is because of GFP_ATOMIC allocations during set flush. When set is flushed, all elements are deactivated. This triggers a set walk and each element gets added to the transaction list. The rbtree and rhashtable sets don't allow the iter callback to sleep: rbtree walk acquires read side of an rwlock with bh disabled, rhashtable walk happens with rcu read lock held. Rbtree is simple enough to resolve: When the walk context is ITER_READ, no change is needed (the iter callback must not deactivate elements; we're not in a transaction). When the iter type is ITER_UPDATE, the rwlock isn't needed because the caller holds the transaction mutex, this prevents any and all changes to the ruleset, including add/remove of set elements. Rhashtable is slightly more complex. When the iter type is ITER_READ, no change is needed, like rbtree. For ITER_UPDATE, we hold transaction mutex which prevents elements from getting free'd, even outside of rcu read lock section. So build a temporary list of all elements while doing the rcu iteration and then call the iterator in a second pass. The disadvantage is the need to iterate twice, but this cost comes with the benefit to allow the iter callback to use GFP_KERNEL allocations in a followup patch. The new list based logic makes it necessary to catch recursive calls to the same set earlier. Such walk -> iter -> walk recursion for the same set can happen during ruleset validation in case userspace gave us a bogus (cyclic) ruleset where verdict map m jumps to chain that sooner or later also calls "vmap @m". Before the new ->in_update_walk test, the ruleset is rejected because the infinite recursion causes ctx->level to exceed the allowed maximum. But with the new logic added here, elements would get skipped: nft_rhash_walk_update would see elements that are on the walk_list of an older stack frame. As all recursive calls into same map results in -EMLINK, we can avoid this problem by using the new in_update_walk flag and reject immediately. Next patch converts the problematic GFP_ATOMIC allocations. Reported-by: Sven Auhagen <Sven.Auhagen@belden.com> Closes: https://lore.kernel.org/netfilter-devel/BY1PR18MB5874110CAFF1ED098D0BC4E7E07BA@BY1PR18MB5874.namprd18.prod.outlook.com/ Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-01inet: ping: make ping_port_rover per netnsEric Dumazet
Provide isolation between netns for ping idents. Randomize initial ping_port_rover value at netns creation. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250829153054.474201-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01inet: ping: remove ping_hash()Eric Dumazet
There is no point in keeping ping_hash(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250829153054.474201-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01tcp: Remove sk->sk_prot->orphan_count.Kuniyuki Iwashima
TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed) sockets in tcp_orphan_count. In some code that was shared with DCCP, tcp_orphan_count is referenced via sk->sk_prot->orphan_count. Let's reference tcp_orphan_count directly. inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c due to header dependency. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01copy_process: pass clone_flags as u64 across calltreeSimon Schuster
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-29net: use dst_dev_rcu() in sk_setup_caps()Eric Dumazet
Use RCU to protect accesses to dst->dev from sk_setup_caps() and sk_dst_gso_max_size(). Also use dst_dev_rcu() in ip6_dst_mtu_maybe_forward(), and ip_dst_mtu_maybe_forward(). ip4_dst_hoplimit() can use dst_dev_net_rcu(). Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29net: dst: introduce dst->dev_rcuEric Dumazet
Followup of commit 88fe14253e18 ("net: dst: add four helpers to annotate data-races around dst->dev"). We want to gradually add explicit RCU protection to dst->dev, including lockdep support. Add an union to alias dst->dev_rcu and dst->dev. Add dst_dev_net_rcu() helper. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c 02614eee26fb ("idpf: do not linearize big TSO packets") 6c4e68480238 ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28net_sched: act_skbmod: use RCU in tcf_skbmod_dump()Eric Dumazet
Also storing tcf_action into struct tcf_skbmod_params makes sure there is no discrepancy in tcf_skbmod_act(). No longer block BH in tcf_skbmod_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28net_sched: act_tunnel_key: use RCU in tunnel_key_dump()Eric Dumazet
Also storing tcf_action into struct tcf_tunnel_key_params makes sure there is no discrepancy in tunnel_key_act(). No longer block BH in tunnel_key_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28net_sched: act_vlan: use RCU in tcf_vlan_dump()Eric Dumazet
Also storing tcf_action into struct tcf_vlan_params makes sure there is no discrepancy in tcf_vlan_act(). No longer block BH in tcf_vlan_init() when acquiring tcf_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250827125349.3505302-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28queue_api: add support for fetching per queue DMA devDragos Tatulea
For zerocopy (io_uring, devmem), there is an assumption that the parent device can do DMA. However that is not always the case: - Scalable Function netdevs [1] have the DMA device in the grandparent. - For Multi-PF netdevs [2] queues can be associated to different DMA devices. This patch introduces the a queue based interface for allowing drivers to expose a different DMA device for zerocopy. [1] Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst [2] Documentation/networking/multi-pf-netdev.rst Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250827144017.1529208-3-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28inet: raw: add drop_counters to raw socketsEric Dumazet
When a packet flood hits one or more RAW sockets, many cpus have to update sk->sk_drops. This slows down other cpus, because currently sk_drops is in sock_write_rx group. Add a socket_drop_counters structure to raw sockets. Using dedicated cache lines to hold drop counters makes sure that consumers no longer suffer from false sharing if/when producers only change sk->sk_drops. This adds 128 bytes per RAW socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28udp: add drop_counters to udp socketEric Dumazet
When a packet flood hits one or more UDP sockets, many cpus have to update sk->sk_drops. This slows down other cpus, because currently sk_drops is in sock_write_rx group. Add a socket_drop_counters structure to udp sockets. Using dedicated cache lines to hold drop counters makes sure that consumers no longer suffer from false sharing if/when producers only change sk->sk_drops. This adds 128 bytes per UDP socket. Tested with the following stress test, sending about 11 Mpps to a dual socket AMD EPYC 7B13 64-Core. super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120 Note: due to socket lookup, only one UDP socket is receiving packets on DUT. Then measure receiver (DUT) behavior. We can see both consumer and BH handlers can process more packets per second. Before: nstat -n ; sleep 1 ; nstat | grep Udp Udp6InDatagrams 615091 0.0 Udp6InErrors 3904277 0.0 Udp6RcvbufErrors 3904277 0.0 After: nstat -n ; sleep 1 ; nstat | grep Udp Udp6InDatagrams 816281 0.0 Udp6InErrors 7497093 0.0 Udp6RcvbufErrors 7497093 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-5-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk->sk_drop_countersEric Dumazet
Some sockets suffer from heavy false sharing on sk->sk_drops, and fields in the same cache line. Add sk->sk_drop_counters to: - move the drop counter(s) to dedicated cache lines. - Add basic NUMA awareness to these drop counter(s). Following patches will use this infrastructure for UDP and RAW sockets. sk_clone_lock() is not yet ready, it would need to properly set newsk->sk_drop_counters if we plan to use this for TCP sockets. v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk_drops_skbadd() helperEric Dumazet
Existing sk_drops_add() helper is renamed to sk_drops_skbadd(). Add sk_drops_add() and convert sk_drops_inc() to use it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpersEric Dumazet
We want to split sk->sk_drops in the future to reduce potential contention on this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-27net: Prevent RPS table overwrite of active flowsKrishna Kumar
This patch fixes an issue where two different flows on the same RXq produce the same hash resulting in continuous flow overwrites. Flow #1: A packet for Flow #1 comes in, kernel calls the steering function. The driver gives back a filter id. The kernel saves this filter id in the selected slot. Later, the driver's service task checks if any filters have expired and then installs the rule for Flow #1. Flow #2: A packet for Flow #2 comes in. It goes through the same steps. But this time, the chosen slot is being used by Flow #1. The driver gives a new filter id and the kernel saves it in the same slot. When the driver's service task runs, it runs through all the flows, checks if Flow #1 should be expired, the kernel returns True as the slot has a different filter id, and then the driver installs the rule for Flow #2. Flow #1: Another packet for Flow #1 comes in. The same thing repeats. The slot is overwritten with a new filter id for Flow #1. This causes a repeated cycle of flow programming for missed packets, wasting CPU cycles while not improving performance. This problem happens at higher rates when the RPS table is small, but tests show it still happens even with 12,000 connections and an RPS size of 16K per queue (global table size = 144x16K = 64K). This patch prevents overwriting an rps_dev_flow entry if it is active. The intention is that it is better to do aRFS for the first flow instead of hurting all flows on the same hash. Without this, two (or more) flows on one RX queue with the same hash can keep overwriting each other. This causes the driver to reprogram the flow repeatedly. Changes: 1. Add a new 'hash' field to struct rps_dev_flow. 2. Add rps_flow_is_active(): a helper function to check if a flow is active or not, extracted from rps_may_expire_flow(). It is further simplified as per reviewer feedback. 3. In set_rps_cpu(): - Avoid overwriting by programming a new filter if: - The slot is not in use, or - The slot is in use but the flow is not active, or - The slot has an active flow with the same hash, but target CPU differs. - Save the hash in the rps_dev_flow entry. 4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active(). Testing & results: - Driver: ice (E810 NIC), Kernel: net-next - #CPUs = #RXq = 144 (1:1) - Number of flows: 12K - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal, it is still sufficient to cover 12K flows (256*144 rx-queues = 64K global table slots) - Global Table Size = 144 * RPS (effectively equal to 256 * RPS) - Each RPS test duration = 8 mins (org code) + 8 mins (new code). - Metrics captured on client Legend for following tables: Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu() Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries Add: #times driver actually programmed aRFS (ice_arfs_build_entry()) Del: #times driver deleted the flow (ice_arfs_del_flow_rules()) Units: K = 1,000 times, M = 1 million times |-------|---------|------| Org Code |---------|---------| | RPS | Latency | CPU | Add | Del | Steer-C | Steer-L | |-------|---------|------|--------|--------|---------|---------| | 256 | 227.0 | 93.2 | 1.6M | 1.6M | 121.7M | 267.6M | | 512 | 225.9 | 94.1 | 11.5M | 11.2M | 65.7M | 199.6M | | 1024 | 223.5 | 95.6 | 16.5M | 16.5M | 27.1M | 187.3M | | 2048 | 222.2 | 96.3 | 10.5M | 10.5M | 12.5M | 115.2M | | 4096 | 223.9 | 94.1 | 5.5M | 5.5M | 7.2M | 65.9M | | 8192 | 224.7 | 92.5 | 2.7M | 2.7M | 3.0M | 29.9M | | 16384 | 223.5 | 92.5 | 1.3M | 1.3M | 1.4M | 13.9M | | 32768 | 219.6 | 93.2 | 838.1K | 838.1K | 965.1K | 8.9M | |-------|---------|------| New Code |---------|---------| | 256 | 201.5 | 99.1 | 13.4K | 5.0K | 13.7K | 75.2K | | 512 | 202.5 | 98.2 | 11.2K | 5.9K | 11.2K | 55.5K | | 1024 | 207.3 | 93.9 | 11.5K | 9.7K | 11.5K | 59.6K | | 2048 | 207.5 | 96.7 | 11.8K | 11.1K | 15.5K | 79.3K | | 4096 | 206.9 | 96.6 | 11.8K | 11.7K | 11.8K | 63.2K | | 8192 | 205.8 | 96.7 | 11.9K | 11.8K | 11.9K | 63.9K | | 16384 | 200.9 | 98.2 | 11.9K | 11.9K | 11.9K | 64.2K | | 32768 | 202.5 | 98.0 | 11.9K | 11.9K | 11.9K | 64.2K | |-------|---------|------|--------|--------|---------|---------| Some observations: 1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67% 2. Overall CPU increased: (777.32-751.49)/751.45*100 = 3.44% 3. Flow Management (add/delete) remained almost constant at ~11K compared to values in millions. Signed-off-by: Krishna Kumar <krikku@gmail.com> Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27net: rose: convert 'use' field to refcount_tTakamitsu Iwai
The 'use' field in struct rose_neigh is used as a reference counter but lacks atomicity. This can lead to race conditions where a rose_neigh structure is freed while still being referenced by other code paths. For example, when rose_neigh->use becomes zero during an ioctl operation via rose_rt_ioctl(), the structure may be removed while its timer is still active, potentially causing use-after-free issues. This patch changes the type of 'use' from unsigned short to refcount_t and updates all code paths to use rose_neigh_hold() and rose_neigh_put() which operate reference counts atomically. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250823085857.47674-3-takamitz@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>