summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2025-07-01net: mana: Handle Reset Request from MANA NICHaiyang Zhang
Upon receiving the Reset Request, pause the connection and clean up queues, wait for the specified period, then resume the NIC. In the cleanup phase, the HWC is no longer responding, so set hwc_timeout to zero to skip waiting on the response. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1751055983-29760-1-git-send-email-haiyangz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30neighbor: Add NTF_EXT_VALIDATED flag for externally validated entriesIdo Schimmel
tl;dr ===== Add a new neighbor flag ("extern_valid") that can be used to indicate to the kernel that a neighbor entry was learned and determined to be valid externally. The kernel will not try to remove or invalidate such an entry, leaving these decisions to the user space control plane. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed. Background ========== In a typical EVPN multi-homing setup each host is multi-homed using a set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf switches (VTEPs). VTEPs that are connected to the same ES are called ES peers. When a neighbor entry is learned on a VTEP, it is distributed to both ES peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers use the neighbor entry when routing traffic towards the multi-homed host and remote VTEPs use it for ARP/NS suppression. Motivation ========== If the ES link between a host and the VTEP on which the neighbor entry was locally learned goes down, the EVPN MAC/IP advertisement route will be withdrawn and the neighbor entries will be removed from both ES peers and remote VTEPs. Routing towards the multi-homed host and ARP/NS suppression can fail until another ES peer locally learns the neighbor entry and distributes it via an EVPN MAC/IP advertisement route. "draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these intermittent failures by having the ES peers install the neighbor entries as before, but also injecting EVPN MAC/IP advertisement routes with a proxy indication. When the previously mentioned ES link goes down and the original EVPN MAC/IP advertisement route is withdrawn, the ES peers will not withdraw their neighbor entries, but instead start aging timers for the proxy indication. If an ES peer locally learns the neighbor entry (i.e., it becomes "reachable"), it will restart its aging timer for the entry and emit an EVPN MAC/IP advertisement route without a proxy indication. An ES peer will stop its aging timer for the proxy indication if it observes the removal of the proxy indication from at least one of the ES peers advertising the entry. In the event that the aging timer for the proxy indication expired, an ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer expired on all ES peers and they all withdrew their proxy advertisements, the neighbor entry will be completely removed from the EVPN fabric. Implementation ============== In the above scheme, when the control plane (e.g., FRR) advertises a neighbor entry with a proxy indication, it expects the corresponding entry in the data plane (i.e., the kernel) to remain valid and not be removed due to garbage collection or loss of carrier. The control plane also expects the kernel to notify it if the entry was learned locally (i.e., became "reachable") so that it will remove the proxy indication from the EVPN MAC/IP advertisement route. That is why these entries cannot be programmed with dummy states such as "permanent" or "noarp". Instead, add a new neighbor flag ("extern_valid") which indicates that the entry was learned and determined to be valid externally and should not be removed or invalidated by the kernel. The kernel can probe the entry and notify user space when it becomes "reachable" (it is initially installed as "stale"). However, if the kernel does not receive a confirmation, have it return the entry to the "stale" state instead of the "failed" state. In other words, an entry marked with the "extern_valid" flag behaves like any other dynamically learned entry other than the fact that the kernel cannot remove or invalidate it. One can argue that the "extern_valid" flag should not prevent garbage collection and that instead a neighbor entry should be programmed with both the "extern_valid" and "extern_learn" flags. There are two reasons for not doing that: 1. Unclear why a control plane would like to program an entry that the kernel cannot invalidate but can completely remove. 2. The "extern_learn" flag is used by FRR for neighbor entries learned on remote VTEPs (for ARP/NS suppression) whereas here we are concerned with local entries. This distinction is currently irrelevant for the kernel, but might be relevant in the future. Given that the flag only makes sense when the neighbor has a valid state, reject attempts to add a neighbor with an invalid state and with this flag set. For example: # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid Error: Cannot create externally validated neighbor with an invalid state. # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid Error: Cannot mark neighbor as externally validated with an invalid state. The above means that a neighbor cannot be created with the "extern_valid" flag and flags such as "use" or "managed" as they result in a neighbor being created with an invalid state ("none") and immediately getting probed: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use Error: Cannot create externally validated neighbor with an invalid state. However, these flags can be used together with "extern_valid" after the neighbor was created with a valid state: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use One consequence of preventing the kernel from invalidating a neighbor entry is that by default it will only try to determine reachability using unicast probes. This can be changed using the "mcast_resolicit" sysctl: # sysctl net.ipv4.neigh.br0/10.mcast_resolicit 0 # tcpdump -nn -e -i br0.10 -Q out arp & # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 iproute2 patches can be found here [2]. [1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03 [2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-27tcp: remove inet_rtx_syn_ack()Eric Dumazet
inet_rtx_syn_ack() is a simple wrapper around tcp_rtx_synack(), if we move req->num_retrans update. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250626153017.2156274-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-27tcp: remove rtx_syn_ack fieldEric Dumazet
Now inet_rtx_syn_ack() is only used by TCP, it can directly call tcp_rtx_synack() instead of using an indirect call to req->rsk_ops->rtx_syn_ack(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250626153017.2156274-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc4). Conflicts: Documentation/netlink/specs/mptcp_pm.yaml 9e6dd4c256d0 ("netlink: specs: mptcp: replace underscores with dashes in names") ec362192aa9e ("netlink: specs: fix up indentation errors") https://lore.kernel.org/20250626122205.389c2cd4@canb.auug.org.au Adjacent changes: Documentation/netlink/specs/fou.yaml 791a9ed0a40d ("netlink: specs: fou: replace underscores with dashes in names") 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25net/sched: Remove unused functionsYue Haibing
Since commit c54e1d920f04 ("flow_offload: add ops to tc_action_ops for flow action setup") these are unused. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://patch.msgid.link/20250624014327.3686873-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25Merge tag 'wireless-next-2025-06-25' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== The usual features/cleanups/etc., notably: - rtw88: IBSS mode for SDIO devices - rtw89: - BT coex for MLO/WiFi7 - work on station + P2P concurrency - ath: fix W=2 export.h warnings - ath12k: fix scan on multi-radio devices - cfg80211/mac80211: MLO statistics - mac80211: S1G aggregation - cfg80211/mac80211: per-radio RTS threshold * tag 'wireless-next-2025-06-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (171 commits) wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd() iwlwifi: Add missing check for alloc_ordered_workqueue wifi: iwlwifi: Fix memory leak in iwl_mvm_init() iwlwifi: api: delete repeated words iwlwifi: remove unused no_sleep_autoadjust declaration iwlwifi: Fix comment typo iwlwifi: use DECLARE_BITMAP macro iwlwifi: fw: simplify the iwl_fw_dbg_collect_trig() wifi: iwlwifi: mld: ftm: fix switch end indentation MAINTAINERS: update iwlwifi git link wifi: iwlwifi: pcie: fix non-MSIX handshake register wifi: iwlwifi: mld: don't exit EMLSR when we shouldn't wifi: iwlwifi: move _iwl_trans_set_bits_mask utilities wifi: iwlwifi: mld: make iwl_mld_add_all_rekeys void wifi: iwlwifi: move iwl_trans_pcie_write_mem to iwl-trans.c wifi: iwlwifi: pcie: move iwl_trans_pcie_dump_regs() to utils.c wifi: iwlwifi: mld: advertise support for TTLM changes wifi: iwlwifi: mld: Block EMLSR when scanning on P2P Device wifi: iwlwifi: mld: use the correct struct size for tracing wifi: iwlwifi: support RZL platform device ID ... ==================== Link: https://patch.msgid.link/20250625120135.41933-55-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-24udp_tunnel: fix deadlock in udp_tunnel_nic_set_port_priv()Paolo Abeni
While configuring a vxlan tunnel in a system with a i40e NIC driver, I observe the following deadlock: WARNING: possible recursive locking detected 6.16.0-rc2.net-next-6.16_92d87230d899+ #13 Tainted: G E -------------------------------------------- kworker/u256:4/1125 is trying to acquire lock: ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: i40e_udp_tunnel_set_port (/home/pabeni/net-next/include/net/udp_tunnel.h:343 /home/pabeni/net-next/drivers/net/ethernet/intel/i40e/i40e_main.c:13013) i40e but task is already holding lock: ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:739) udp_tunnel other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&utn->lock); lock(&utn->lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by kworker/u256:4/1125: #0: ffff8892910ca158 ((wq_completion)udp_tunnel_nic){+.+.}-{0:0}, at: process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3213) #1: ffffc900244efd30 ((work_completion)(&utn->work)){+.+.}-{0:0}, at: process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3214) #2: ffffffff9a14e290 (rtnl_mutex){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:737) udp_tunnel #3: ffff88921ab9c8c8 (&utn->lock){+.+.}-{4:4}, at: udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:739) udp_tunnel stack backtrace: Hardware name: Dell Inc. PowerEdge R7525/0YHMCJ, BIOS 2.2.5 04/08/2021 i Call Trace: <TASK> dump_stack_lvl (/home/pabeni/net-next/lib/dump_stack.c:123) print_deadlock_bug (/home/pabeni/net-next/kernel/locking/lockdep.c:3047) validate_chain (/home/pabeni/net-next/kernel/locking/lockdep.c:3901) __lock_acquire (/home/pabeni/net-next/kernel/locking/lockdep.c:5240) lock_acquire.part.0 (/home/pabeni/net-next/kernel/locking/lockdep.c:473 /home/pabeni/net-next/kernel/locking/lockdep.c:5873) __mutex_lock (/home/pabeni/net-next/kernel/locking/mutex.c:604 /home/pabeni/net-next/kernel/locking/mutex.c:747) i40e_udp_tunnel_set_port (/home/pabeni/net-next/include/net/udp_tunnel.h:343 /home/pabeni/net-next/drivers/net/ethernet/intel/i40e/i40e_main.c:13013) i40e udp_tunnel_nic_device_sync_by_port (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:230 /home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:249) udp_tunnel __udp_tunnel_nic_device_sync.part.0 (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:292) udp_tunnel udp_tunnel_nic_device_sync_work (/home/pabeni/net-next/net/ipv4/udp_tunnel_nic.c:742) udp_tunnel process_one_work (/home/pabeni/net-next/kernel/workqueue.c:3243) worker_thread (/home/pabeni/net-next/kernel/workqueue.c:3315 /home/pabeni/net-next/kernel/workqueue.c:3402) kthread (/home/pabeni/net-next/kernel/kthread.c:464) AFAICS all the existing callsites of udp_tunnel_nic_set_port_priv() are already under the utn lock scope, avoid (re-)acquiring it in such a function. Fixes: 1ead7501094c ("udp_tunnel: remove rtnl_lock dependency") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/95a827621ec78c12d1564ec3209e549774f9657d.1750675978.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-24wifi: mac80211: handle station association response with S1GLachlan Hodges
Add support for updating the stations S1G capabilities when an S1G association occurs. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250617080610.756048-3-lachlan.hodges@morsemicro.com [remove unused S1G_CAP3_MAX_MPDU_LEN_3895/_7791] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: cfg80211: support configuration of S1G station capabilitiesLachlan Hodges
Currently there is no support for initialising a peers S1G capabilities, this patch adds support for configuring an S1G station. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250617080610.756048-2-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: cfg80211: Add Support to Set RTS Threshold for each RadioRoopni Devanathan
Currently, setting RTS threshold is based on per-phy basis, i.e., all the radios present in a wiphy will take RTS threshold value to be the one sent from userspace. But each radio in a multi-radio wiphy can have different RTS threshold requirements. To extend support to set RTS threshold for each radio, get the radio for which RTS threshold needs to be changed from the user. Use the attribute in NL - NL80211_ATTR_WIPHY_RADIO_INDEX, to identify the radio of interest. Create a new structure - wiphy_radio_cfg and add rts_threshold in it as a u32 value to store RTS threshold of each radio in a wiphy and allocate memory for it during wiphy register based on the wiphy.n_radio updated by drivers. Pass radio id received from the user to mac80211 drivers along with its corresponding RTS threshold. Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com> Link: https://patch.msgid.link/20250615082312.619639-3-quic_rdevanat@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: cfg80211/mac80211: Add support to get radio indexRoopni Devanathan
Currently, per-radio attributes are set on per-phy basis, i.e., all the radios present in a wiphy will take attributes values sent from user. But each radio in a wiphy can get different values from userspace based on its requirement. To extend support to set per-radio attributes, add support to get radio index from userspace. Add an NL attribute - NL80211_ATTR_WIPHY_RADIO_INDEX, to get user specified radio index for which attributes should be changed. Pass this to individual drivers, so that the drivers can use this radio index to change per-radio attributes when necessary. Currently, per-radio attributes identified are: NL80211_ATTR_WIPHY_TX_POWER_LEVEL NL80211_ATTR_WIPHY_ANTENNA_TX NL80211_ATTR_WIPHY_ANTENNA_RX NL80211_ATTR_WIPHY_RETRY_SHORT NL80211_ATTR_WIPHY_RETRY_LONG NL80211_ATTR_WIPHY_FRAG_THRESHOLD NL80211_ATTR_WIPHY_RTS_THRESHOLD NL80211_ATTR_WIPHY_COVERAGE_CLASS NL80211_ATTR_TXQ_LIMIT NL80211_ATTR_TXQ_MEMORY_LIMIT NL80211_ATTR_TXQ_QUANTUM By default, the radio index is set to -1. This means the attribute should be treated as a global configuration. If the user has not specified any index, then the radio index passed to individual drivers would be -1. This would indicate that the attribute applies to all radios in that wiphy. Signed-off-by: Roopni Devanathan <quic_rdevanat@quicinc.com> Link: https://patch.msgid.link/20250615082312.619639-2-quic_rdevanat@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: mac80211: add link_sta_statistics ops to fill link station statisticsSarika Sharma
Currently, link station statistics for MLO are filled by mac80211. But there are some statistics that kept by mac80211 might not be accurate, so let the driver pre-fill the link statistics. The driver can fill the values (indicating which field is filled, by setting the filled bitmapin in link_station structure). Statistics that driver don't fill are filled by mac80211. Hence, add link_sta_statistics callback to fill link station statistics for MLO in sta_set_link_sinfo() by drivers. Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com> Link: https://patch.msgid.link/20250528054420.3050133-11-quic_sarishar@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: mac80211: extend support to fill link level sinfo structureSarika Sharma
Currently, sinfo structure is supported to fill information at deflink( or one of the links) level for station. This has problems when applied to fetch multi-link(ML) station information. Hence, if valid_links are present, support filling link_station structure for each link. This will be helpful to check the link related statistics during MLO. Additionally, TXQ stats for pertid are applicable at station level not at link level. Therefore check link_id is less then 0, before filling TXQ stats in pertid stats. Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com> Link: https://patch.msgid.link/20250528054420.3050133-9-quic_sarishar@quicinc.com [fix some indentation] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: cfg80211: allocate memory for link_station info structureSarika Sharma
Currently, station_info structure is passed to fill station statistics from mac80211/drivers. After NL message send to user space for requested station statistics, memory for station statistics is freed in cfg80211. Therefore, memory allocation/free for link station statistics should also happen in cfg80211 only. Hence, allocate the memory for link_station structure for all possible links and free in cfg80211_sinfo_release_content(). Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com> Link: https://patch.msgid.link/20250528054420.3050133-6-quic_sarishar@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: cfg80211: add link_station_info structure to support MLO statisticsSarika Sharma
Current implementation of NL80211_GET_STATION does not work for multi-link operation(MLO) since in case of MLO only deflink (or one of the links) is considered and not all links. Therefore to support for MLO, add link_station_info structure to account link level statistics for station. Additionally, add valid_links in station_info structure to indicate bitmap of valid links for MLO. This will be helpful to check the link related statistics during MLO. Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com> Link: https://patch.msgid.link/20250528054420.3050133-3-quic_sarishar@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24wifi: mac80211: add support towards MLO handling of station statisticsSarika Sharma
Currently, in supporting API's to fill sinfo structure from sta structure, is mapped to fill the fields from sta->deflink. However, for multi-link (ML) station, sinfo structure should be filled from corresponding link_id. Therefore, add link_id as an additional argument in supporting API's for filling sinfo structure correctly. Link_id is set to -1 for non-ML station and corresponding link_id for ML stations. In supporting API's for filling sinfo structure, check for link_id, if link_id < 0, fill the sinfo structure from sta->deflink, otherwise fill from sta->link[link_id]. Current, changes are done at the deflink level i.e, pass -1 as link_id. Actual link_id will be added in subsequent patches to support station statistics for MLO. Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com> Link: https://patch.msgid.link/20250528054420.3050133-2-quic_sarishar@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-24module: remove meaningless 'name' parameter from __MODULE_INFO()Masahiro Yamada
The symbol names in the .modinfo section are never used and already randomized by the __UNIQUE_ID() macro. Therefore, the second parameter of __MODULE_INFO() is meaningless and can be removed to simplify the code. With this change, the symbol names in the .modinfo section will be prefixed with __UNIQUE_ID_modinfo, making it clearer that they originate from MODULE_INFO(). [Before] $ objcopy -j .modinfo vmlinux.o modinfo.o $ nm -n modinfo.o | head -n10 0000000000000000 r __UNIQUE_ID_license560 0000000000000011 r __UNIQUE_ID_file559 0000000000000030 r __UNIQUE_ID_description558 0000000000000074 r __UNIQUE_ID_license580 000000000000008e r __UNIQUE_ID_file579 00000000000000bd r __UNIQUE_ID_description578 00000000000000e6 r __UNIQUE_ID_license581 00000000000000ff r __UNIQUE_ID_file580 0000000000000134 r __UNIQUE_ID_description579 0000000000000179 r __UNIQUE_ID_uncore_no_discover578 [After] $ objcopy -j .modinfo vmlinux.o modinfo.o $ nm -n modinfo.o | head -n10 0000000000000000 r __UNIQUE_ID_modinfo560 0000000000000011 r __UNIQUE_ID_modinfo559 0000000000000030 r __UNIQUE_ID_modinfo558 0000000000000074 r __UNIQUE_ID_modinfo580 000000000000008e r __UNIQUE_ID_modinfo579 00000000000000bd r __UNIQUE_ID_modinfo578 00000000000000e6 r __UNIQUE_ID_modinfo581 00000000000000ff r __UNIQUE_ID_modinfo580 0000000000000134 r __UNIQUE_ID_modinfo579 0000000000000179 r __UNIQUE_ID_modinfo578 Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
2025-06-23net: make sk->sk_rcvtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout() and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: make sk->sk_sndtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from sock_set_sndtimeo(), and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: remove sock_i_uid()Eric Dumazet
Difference between sock_i_uid() and sk_uid() is that after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID while sk_uid() returns the last cached sk->sk_uid value. None of sock_i_uid() callers care about this. Use sk_uid() which is much faster and inlined. Note that diag/dump users are calling sock_i_ino() and can not see the full benefit yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: annotate races around sk->sk_uidEric Dumazet
sk->sk_uid can be read while another thread changes its value in sockfs_setattr(). Add sk_uid(const struct sock *sk) helper to factorize the needed READ_ONCE() annotations, and add corresponding WRITE_ONCE() where needed. Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23Merge branch 'timestamp-for-jens' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next into for-6.17/io_uring Pull networking side timestamp prep patch from Jakub. * 'timestamp-for-jens' of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: net: timestamp: add helper returning skb's tx tstamp
2025-06-23Bluetooth: hci_core: Fix use-after-free in vhci_flush()Kuniyuki Iwashima
syzbot reported use-after-free in vhci_flush() without repro. [0] From the splat, a thread close()d a vhci file descriptor while its device was being used by iotcl() on another thread. Once the last fd refcnt is released, vhci_release() calls hci_unregister_dev(), hci_free_dev(), and kfree() for struct vhci_data, which is set to hci_dev->dev->driver_data. The problem is that there is no synchronisation after unlinking hdev from hci_dev_list in hci_unregister_dev(). There might be another thread still accessing the hdev which was fetched before the unlink operation. We can use SRCU for such synchronisation. Let's run hci_dev_reset() under SRCU and wait for its completion in hci_unregister_dev(). Another option would be to restore hci_dev->destruct(), which was removed in commit 587ae086f6e4 ("Bluetooth: Remove unused hci-destruct cb"). However, this would not be a good solution, as we should not run hci_unregister_dev() while there are in-flight ioctl() requests, which could lead to another data-race KCSAN splat. Note that other drivers seem to have the same problem, for exmaple, virtbt_remove(). [0]: BUG: KASAN: slab-use-after-free in skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline] BUG: KASAN: slab-use-after-free in skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937 Read of size 8 at addr ffff88807cb8d858 by task syz.1.219/6718 CPU: 1 UID: 0 PID: 6718 Comm: syz.1.219 Not tainted 6.16.0-rc1-syzkaller-00196-g08207f42d3ff #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xd2/0x2b0 mm/kasan/report.c:521 kasan_report+0x118/0x150 mm/kasan/report.c:634 skb_queue_empty_lockless include/linux/skbuff.h:1891 [inline] skb_queue_purge_reason+0x99/0x360 net/core/skbuff.c:3937 skb_queue_purge include/linux/skbuff.h:3368 [inline] vhci_flush+0x44/0x50 drivers/bluetooth/hci_vhci.c:69 hci_dev_do_reset net/bluetooth/hci_core.c:552 [inline] hci_dev_reset+0x420/0x5c0 net/bluetooth/hci_core.c:592 sock_do_ioctl+0xd9/0x300 net/socket.c:1190 sock_ioctl+0x576/0x790 net/socket.c:1311 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fcf5b98e929 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fcf5c7b9038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fcf5bbb6160 RCX: 00007fcf5b98e929 RDX: 0000000000000000 RSI: 00000000400448cb RDI: 0000000000000009 RBP: 00007fcf5ba10b39 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007fcf5bbb6160 R15: 00007ffd6353d528 </TASK> Allocated by task 6535: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] vhci_open+0x57/0x360 drivers/bluetooth/hci_vhci.c:635 misc_open+0x2bc/0x330 drivers/char/misc.c:161 chrdev_open+0x4c9/0x5e0 fs/char_dev.c:414 do_dentry_open+0xdf0/0x1970 fs/open.c:964 vfs_open+0x3b/0x340 fs/open.c:1094 do_open fs/namei.c:3887 [inline] path_openat+0x2ee5/0x3830 fs/namei.c:4046 do_filp_open+0x1fa/0x410 fs/namei.c:4073 do_sys_openat2+0x121/0x1c0 fs/open.c:1437 do_sys_open fs/open.c:1452 [inline] __do_sys_openat fs/open.c:1468 [inline] __se_sys_openat fs/open.c:1463 [inline] __x64_sys_openat+0x138/0x170 fs/open.c:1463 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 6535: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2381 [inline] slab_free mm/slub.c:4643 [inline] kfree+0x18e/0x440 mm/slub.c:4842 vhci_release+0xbc/0xd0 drivers/bluetooth/hci_vhci.c:671 __fput+0x44c/0xa70 fs/file_table.c:465 task_work_run+0x1d1/0x260 kernel/task_work.c:227 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x6ad/0x22e0 kernel/exit.c:955 do_group_exit+0x21c/0x2d0 kernel/exit.c:1104 __do_sys_exit_group kernel/exit.c:1115 [inline] __se_sys_exit_group kernel/exit.c:1113 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1113 x64_sys_call+0x21ba/0x21c0 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff88807cb8d800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 88 bytes inside of freed 1024-byte region [ffff88807cb8d800, ffff88807cb8dc00) Fixes: bf18c7118cf8 ("Bluetooth: vhci: Free driver_data on file release") Reported-by: syzbot+2faa4825e556199361f9@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f62d64848fc4c7c30cd6 Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-06-20wifi: cfg80211: Add support for link reconfiguration negotiation offload to ↵Kavita Kavita
driver In the case of SME-in-driver, the driver can internally choose to update the links based on the AP MLD recommendation and do link reconfiguration negotiation with AP MLD. (e.g., After the driver processing the BSS Transition Management request frame received from the AP MLD with Neighbor Report containing Multi-Link element with recommended links information chooses to do link reconfiguration negotiation with AP MLD). To support this, extend cfg80211_mlo_reconf_add_done() and NL80211_CMD_ASSOC_MLO_RECONF to indicate added links information for driver-initiated link reconfiguration requests. For removed links, the driver indicates links information using the NL80211_CMD_LINKS_REMOVED event for driver-initiated cases, the same as supplicant initiated cases. For the driver-initiated case, cfg80211 will receive link reconfiguration result asynchronously from driver so holding BSSes of the accepted add links is needed in the event path. Also, no need of unhold call for the rejected add link BSSes since there was no hold call happened previously. Once the supplicant receives the NL80211_CMD_ASSOC_MLO_RECONF event, it needs to process the information about newly added links and install per-link group keys (e.g., GTK/IGTK/BIGTK etc.). In case of the SME-in-driver, using a vendor interface etc. to notify the supplicant to initiate a link reconfiguration request and then supplicant sending command to the cfg80211 can lead to race conditions. The correct design to avoid this is that the driver indicates the cfg80211 directly with the results of the link reconfiguration negotiation. Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com> Link: https://patch.msgid.link/20250604105757.2542-3-quic_kkavita@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20wifi: cfg80211: Add utility API to get radio index from channelVasanthakumar Thiagarajan
Add utility API cfg80211_get_radio_idx_by_chan() to retrieve the radio index corresponding to a given channel in a multi-radio wiphy. This utility function can be used when we want to check the radio-specific data for a channel in a multi-radio wiphy. For example, it can help determine the radio index required to handle a scan request. This index can then be used to decide whether the scan can proceed without interfering with ongoing DFS operations on another radio. Signed-off-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com> Co-developed-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com> Signed-off-by: Raj Kumar Bhagat <quic_rajkbhag@quicinc.com> Link: https://patch.msgid.link/20250527-mlo-dfs-acs-v2-1-92c2f37c81d9@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-19neighbour: add support for NUD_PERMANENT proxy entriesNicolas Escande
As discussesd before in [0] proxy entries (which are more configuration than runtime data) should stay when the link (carrier) goes does down. This is what happens for regular neighbour entries. So lets fix this by: - storing in proxy entries the fact that it was added as NUD_PERMANENT - not removing NUD_PERMANENT proxy entries when the carrier goes down (same as how it's done in neigh_flush_dev() for regular neigh entries) [0]: https://lore.kernel.org/netdev/c584ef7e-6897-01f3-5b80-12b53f7b4bf4@kernel.org/ Signed-off-by: Nicolas Escande <nico.escande@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250617141334.3724863-1-nico.escande@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: mana: Handle unsupported HWC commandsErni Sri Satya Vennela
If any of the HWC commands are not recognized by the underlying hardware, the hardware returns the response header status of -1. Log the information using netdev_info_once to avoid multiple error logs in dmesg. Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Link: https://patch.msgid.link/1750144656-2021-5-git-send-email-ernis@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-19net: mana: Add speed support in mana_get_link_ksettingsErni Sri Satya Vennela
Allow mana ethtool get_link_ksettings operation to report the maximum speed supported by the SKU in mbps. The driver retrieves this information by issuing a HWC command to the hardware via mana_query_link_cfg(), which retrieves the SKU's maximum supported speed. These APIs when invoked on hardware that are older/do not support these APIs, the speed would be reported as UNKNOWN. Before: $ethtool enP30832s1 > Settings for enP30832s1: Supported ports: [ ] Supported link modes: Not reported Supported pause frame use: No Supports auto-negotiation: No Supported FEC modes: Not reported Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Advertised FEC modes: Not reported Speed: Unknown! Duplex: Full Auto-negotiation: off Port: Other PHYAD: 0 Transceiver: internal Link detected: yes After: $ethtool enP30832s1 > Settings for enP30832s1: Supported ports: [ ] Supported link modes: Not reported Supported pause frame use: No Supports auto-negotiation: No Supported FEC modes: Not reported Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Advertised FEC modes: Not reported Speed: 16000Mb/s Duplex: Full Auto-negotiation: off Port: Other PHYAD: 0 Transceiver: internal Link detected: yes Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/1750144656-2021-4-git-send-email-ernis@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-19net: mana: Add support for net_shaper_opsErni Sri Satya Vennela
Introduce support for net_shaper_ops in the MANA driver, enabling configuration of rate limiting on the MANA NIC. To apply rate limiting, the driver issues a HWC command via mana_set_bw_clamp() and updates the corresponding shaper object in the net_shaper cache. If an error occurs during this process, the driver restores the previous speed by querying the current link configuration using mana_query_link_cfg(). The minimum supported bandwidth is 100 Mbps, and only values that are exact multiples of 100 Mbps are allowed. Any other values are rejected. To remove a shaper, the driver resets the bandwidth to the maximum supported by the SKU using mana_set_bw_clamp() and clears the associated cache entry. If an error occurs during this process, the shaper details are retained. On the hardware that does not support these APIs, the net-shaper calls to set speed would fail. Set the speed: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/net_shaper.yaml \ --do set --json '{"ifindex":'$IFINDEX', "handle":{"scope": "netdev", "id":'$ID' }, "bw-max": 200000000 }' Get the shaper details: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/net_shaper.yaml \ --do get --json '{"ifindex":'$IFINDEX', "handle":{"scope": "netdev", "id":'$ID' }}' > {'bw-max': 200000000, > 'handle': {'scope': 'netdev'}, > 'ifindex': $IFINDEX, > 'metric': 'bps'} Delete the shaper object: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/net_shaper.yaml \ --do delete --json '{"ifindex":'$IFINDEX', "handle":{"scope": "netdev","id":'$ID' }}' Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/1750144656-2021-3-git-send-email-ernis@linux.microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-18devlink: Add new "enable_phc" generic device paramDavid Arinzon
Add a new device generic parameter to enable/disable the PHC (PTP Hardware Clock) functionality in the device associated with the devlink instance. Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250617110545.5659-6-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18udp_tunnel: remove rtnl_lock dependencyStanislav Fomichev
Drivers that are using ops lock and don't depend on RTNL lock still need to manage it because udp_tunnel's RTNL dependency. Introduce new udp_tunnel_nic_lock and use it instead of rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to grab udp_tunnel_nic_lock mutex and might sleep). Cover more places in v4: - netlink - udp_tunnel_notify_add_rx_port (ndo_open) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_notify_del_rx_port (ndo_stop) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_get_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - udp_tunnel_drop_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_DROP_INFO - udp_tunnel_nic_reset_ntf (ndo_open) - notifiers - udp_tunnel_nic_netdevice_event, depending on the event: - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - triggers NETDEV_UDP_TUNNEL_DROP_INFO - ethnl_tunnel_info_reply_size - udp_tunnel_nic_set_port_priv (two intel drivers) Cc: Michael Chan <michael.chan@broadcom.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18tcp: Remove inet_hashinfo2_free_mod()Yue Haibing
DCCP was removed, inet_hashinfo2_free_mod() is unused now. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250617130613.498659-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17Merge branch '200GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== libeth: add libeth_xdp helper lib Alexander Lobakin says: Time to add XDP helpers infra to libeth to greatly simplify adding XDP to idpf and iavf, as well as improve and extend XDP in ice and i40e. Any vendor is free to reuse helpers. If this happens, I'm fine with moving the folder of out intel/. The helpers greatly simplify building xdp_buff, running a prog, handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer completion. Same applies to XSk (with XSk xmit instead of .ndo_xdp_xmit, plus stuff like XSk wakeup). They are entirely generic with no HW definitions or assumptions. HW-specific stuff like parsing Rx desc / filling Tx desc is passed from the driver as inline callbacks. For now, key assumptions that optimize performance / avoid code bloat, but might not fit every driver in driver/net/: * netmem holding the buffers are always order-0; * driver has separate XDP Tx queues, doesn't use stack queues for that. For best efficiency, you may want to have nr_cpu_ids XDP queues, but less (queue sharing) is also supported; * XDP Tx queues are interrupt-less and use "lazy" cleaning only when there are less than 1/4 free Tx descriptors of the queue size; * main target platforms are 64-bit, although 32-bit is also fully supported, but the code might be not as optimized for them. Library code already supports multi-buffer for all kinds of Tx and both header split and no split for Rx and Tx. Frags can come from devmem/io_uring etc., direct `struct page *` is used only for header buffers for which it's always true. Drivers are free to pass their own Rx hints and XSK xmit hints ops. XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent and send them by batches of 16 buffers. This eats ~280 bytes on the stack, but gives good boosts and allow to greatly optimize the main sending function leaving it without any error/exception paths. XSk xmit fills Tx descriptors in the loop unrolled by 8. This was proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit doesn't use unrolling as I wasn't able to get any improvements in those scenenarios from this, while +1 Kb for their sending functions for nothing doesn't sound reasonable. XSk wakeup, instead of traditionally used "SW interrupts" provided by NICs, uses IPI to schedule NAPI on the CPU corresponding to the given queue pair. It gives better control over CPU distribution and in general performs way better than "SW interrupts", plus allows us to not pass any HW-specific callbacks there. The code is built the way that all callbacks passed from drivers get inlined; in general, most of hotpath gets inlined. Everything slow/exception lands to .c files in the libeth folder, doesn't create copies in the drivers themselves and doesn't overloat hotpath. Sure, inlining means that hotpath will be compiled into every driver that uses the lib, but the core code is written in one place, so no copying of bugs happens. Fixed once -- works everywhere. The last commit might look like sorta hack, but it gives really good boosts and decreases object code size, plus there are checks that all those wider accesses are fully safe, so I don't feel anything bad about it. An example of using libeth_xdp can be found either on my GitHub or on the mailing lists here ("XDP for idpf"). Macros for building driver XDP functions lead to that some implementations (XDP_TX, ndo_xdp_xmit etc.) consist of really only a few lines. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: libeth: xdp, xsk: access adjacent u32s as u64 where applicable libeth: xsk: add XSkFQ refill and XSk wakeup helpers libeth: xsk: add XSk Rx processing support libeth: xsk: add XSk xmit functions libeth: xsk: add XSk XDP_TX sending helpers libeth: xdp: add RSS hash hint and XDP features setup helpers libeth: xdp: add templates for building driver-side callbacks libeth: xdp: add XDP prog run and verdict result handling libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff libeth: xdp: add XDPSQ cleanup timers libeth: xdp: add XDPSQ locking helpers libeth: xdp: add XDPSQE completion helpers libeth: xdp: add .ndo_xdp_xmit() helpers libeth: xdp: add XDP_TX buffers sending libeth: support native XDP and register memory model libeth: convert to netmem libeth, libie: clean symbol exports up a little ==================== Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17page_pool: Add page_pool_dev_alloc_netmems helperDragos Tatulea
This is the netmem counterpart of page_pool_dev_alloc_pages() which uses the default GFP flags for RX. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: Allow const args for of page_to_netmem()Dragos Tatulea
This allows calling page_to_netmem() with a const page * argument. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: tcp: tsq: Convert from tasklet to BH workqueueTejun Heo
The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This patch converts TCP Small Queues implementation from tasklet to BH workqueue. Semantically, this is an equivalent conversion and there shouldn't be any user-visible behavior changes. While workqueue's queueing and execution paths are a bit heavier than tasklet's, unless the work item is being queued every packet, the difference hopefully shouldn't matter. My experience with the networking stack is very limited and this patch definitely needs attention from someone who actually understands networking. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17vxlan: Support MC routing in the underlayPetr Machata
Locally-generated MC packets have so far not been subject to MC routing. Instead an MC-enabled installation would maintain the MC routing tables, and separately from that the list of interfaces to send packets to as part of the VXLAN FDB and MDB. In a previous patch, a ip_mr_output() and ip6_mr_output() routines were added for IPv4 and IPv6. All locally generated MC traffic is now passed through these functions. For reasons of backward compatibility, an SKB (IPCB / IP6CB) flag guards the actual MC routing. This patch adds logic to set the flag, and the UAPI to enable the behavior. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/d899655bb7e9b2521ee8c793e67056b9fd02ba12.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()Petr Machata
ip6tunnel_xmit() erases the contents of the SKB control block. In order to be able to set particular IP6CB flags on the SKB, add a corresponding parameter, and propagate it to udp_tunnel6_xmit_skb() as well. In one of the following patches, VXLAN driver will use this facility to mark packets as subject to IPv6 multicast routing. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv6: Make udp_tunnel6_xmit_skb() voidPetr Machata
The function always returns zero, thus the return value does not carry any signal. Just make it void. Most callers already ignore the return value. However: - Refold arguments of the call from sctp_v6_xmit() so that they fit into the 80-column limit. - tipc_udp_xmit() initializes err from the return value, but that should already be always zero at that point. So there's no practical change, but elision of the assignment prompts a couple more tweaks to clean up the function. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv4: Add ip_mr_output()Petr Machata
Multicast routing is today handled in the input path. Locally generated MC packets don't hit the IPMR code today. Thus if a VXLAN remote address is multicast, the driver needs to set an OIF during route lookup. Thus MC routing configuration needs to be kept in sync with the VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC routing code instead. To that end, this patch adds support to route locally generated multicast packets. The newly-added routines do largely what ip_mr_input() and ip_mr_forward() do: make an MR cache lookup to find where to send the packets, and use ip_mc_output() to send each of them. When no cache entry is found, the packet is punted to the daemon for resolution. However, an installation that uses a VXLAN underlay netdevice for which it also has matching MC routes, would get a different routing with this patch. Previously, the MC packets would be delivered directly to the underlay port, whereas now they would be MC-routed. In order to avoid this change in behavior, introduce an IPCB flag. Only if the flag is set will ip_mr_output() actually engage, otherwise it reverts to ip_mc_output(). This code is based on work by Roopa Prabhu and Nikolay Aleksandrov. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()Petr Machata
iptunnel_xmit() erases the contents of the SKB control block. In order to be able to set particular IPCB flags on the SKB, add a corresponding parameter, and propagate it to udp_tunnel_xmit_skb() as well. In one of the following patches, VXLAN driver will use this facility to mark packets as subject to IP multicast routing. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17netmem: fix netmem commentsMina Almasry
Trivial fix to a couple of outdated netmem comments. No code changes, just more accurately describing current code. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250615203511.591438-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: dsa: tag_brcm: add support for legacy FCS tagsÁlvaro Fernández Rojas
Add support for legacy Broadcom FCS tags, which are similar to DSA_TAG_PROTO_BRCM_LEGACY. BCM5325 and BCM5365 switches require including the original FCS value and length, as opposed to BCM63xx switches. Adding the original FCS value and length to DSA_TAG_PROTO_BRCM_LEGACY would impact performance of BCM63xx switches, so it's better to create a new tag. Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250614080000.1884236-3-noltari@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17tcp: remove RFC3517/RFC6675 tcp_clear_retrans_hints_partial()Neal Cardwell
Now that we have removed the RFC3517/RFC6675 hints, tcp_clear_retrans_hints_partial() is empty, and can be removed. Suggested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250615001435.2390793-4-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17tcp: remove RFC3517/RFC6675 hint state: lost_skb_hint, lost_cnt_hintNeal Cardwell
Now that obsolete RFC3517/RFC6675 TCP loss detection has been removed, we can remove the somewhat complex and intrusive code to maintain its hint state: lost_skb_hint and lost_cnt_hint. This commit makes tcp_clear_retrans_hints_partial() empty. We will remove tcp_clear_retrans_hints_partial() and its call sites in the next commit. Suggested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250615001435.2390793-3-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17Merge branch 'io_uring-cmd-for-tx-timestamps'Jakub Kicinski
Pavel Begunkov says: ==================== io_uring cmd for tx timestamps (part) Apply the networking helpers for the io_uring timestamp API. ==================== Link: https://patch.msgid.link/cover.1750065793.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: timestamp: add helper returning skb's tx tstampPavel Begunkov
Add a helper function skb_get_tx_timestamp() that returns a tx timestamp associated with an error queue skb. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/702357dd8936ef4c0d3864441e853bfe3224a677.1750065793.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17Merge branch 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linuxJakub Kicinski
Shradha Gupta says: ==================== Allow dyn MSI-X vector allocation of MANA In this patchset we want to enable the MANA driver to be able to allocate MSI-X vectors in PCI dynamically. The first patch exports pci_msix_prepare_desc() in PCI to be able to correctly prepare descriptors for dynamically added MSI-X vectors. The second patch adds the support of dynamic vector allocation in pci-hyperv PCI controller by enabling the MSI_FLAG_PCI_MSIX_ALLOC_DYN flag and using the pci_msix_prepare_desc() exported in first patch. The third patch adds a detailed description of the irq_setup(), to help understand the function design better. The fourth patch is a preparation patch for mana changes to support dynamic IRQ allocation. It contains changes in irq_setup() to allow skipping first sibling CPU sets, in case certain IRQs are already affinitized to them. The fifth patch has the changes in MANA driver to be able to allocate MSI-X vectors dynamically. If the support does not exist it defaults to older behavior. * 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linux: net: mana: Allocate MSI-X vectors dynamically net: mana: Allow irq_setup() to skip cpus for affinity net: mana: explain irq_setup() algorithm PCI: hv: Allow dynamic MSI-X vector allocation PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations ==================== Link: https://patch.msgid.link/1749650984-9193-1-git-send-email-shradhagupta@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: mana: Allocate MSI-X vectors dynamicallyShradha Gupta
Currently, the MANA driver allocates MSI-X vectors statically based on MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends up allocating more vectors than it needs. This is because, by this time we do not have a HW channel and do not know how many IRQs should be allocated. To avoid this, we allocate 1 MSI-X vector during the creation of HWC and after getting the value supported by hardware, dynamically add the remaining MSI-X vectors. Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>