summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2026-01-20net: fclone allocation small optimizationEric Dumazet
After skb allocation, initial skb->fclone value is 0 (SKB_FCLONE_UNAVAILABLE) We can replace one RMW sequence with a single OR instruction. movzbl 0x7e(%r13),%eax // skb->fclone = SKB_FCLONE_ORIG; and $0xf3,%al or $0x4,%al mov %al,0x7e(%r13) -> or $0x4,0x7e(%r13) // skb->fclone |= SKB_FCLONE_ORIG; Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260116164402.1872649-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: remove legacy way to get/set HW timestamp configVadim Fedorenko
With all drivers converted to use ndo_hwstamp callbacks the legacy way can be removed, marking ioctl interface as deprecated. Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260116062121.1230184-1-vadim.fedorenko@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: split kmalloc_reserve() to allow inliningEric Dumazet
kmalloc_reserve() is too big to be inlined. Put the slow path in a new out-of-line function : kmalloc_pfmemalloc() Then let kmalloc_reserve() set skb->pfmemalloc only when/if the slow path is taken. This makes __alloc_skb() faster : - kmalloc_reserve() is now automatically inlined by both gcc and clang. - No more expensive RMW (skb->pfmemalloc = pfmemalloc). - No more expensive stack canary (for CONFIG_STACKPROTECTOR_STRONG=y). - Removal of two prefetches that were coming too late for modern cpus. Text size increase is quite small compared to the cpu savings (~0.7 %) $ size net/core/skbuff.clang.before.o net/core/skbuff.clang.after.o text data bss dec hex filename 72507 5897 0 78404 13244 net/core/skbuff.clang.before.o 72681 5897 0 78578 132f2 net/core/skbuff.clang.after.o Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260116041359.181104-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20Merge tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linuxJakub Kicinski
Pavel Begunkov says: ==================== Add support for providers with large rx buffer Many modern NICs support configurable receive buffer lengths, and zcrx and memory providers can use buffers larger than 4K to improve performance. When paired with hw-gro larger rx buffer sizes can drastically reduce the number of buffers traversing the stack and save a lot of processing time. It also allows to give to users larger contiguous chunks of data. Single stream benchmarks showed up to ~30% CPU util improvement. E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC: packets=23987040 (MB=2745098), rps=199559 (MB/s=22837) CPU %usr %nice %sys %iowait %irq %soft %idle 0 1.53 0.00 27.78 2.72 1.31 66.45 0.22 packets=24078368 (MB=2755550), rps=200319 (MB/s=22924) CPU %usr %nice %sys %iowait %irq %soft %idle 0 0.69 0.00 8.26 31.65 1.83 57.00 0.57 This series adds net infrastructure for memory providers configuring the size and implements it for bnxt. It's an opt-in feature for drivers, they should advertise support for the parameter in the qops and must check if the hardware supports the given size. It's limited to memory providers as it drastically simplifies implementation. It doesn't affect the fast path zcrx uAPI, and the user exposed parameter is defined in zcrx terms, which allows it to be flexible and adjusted in the future. A liburing example can be found at [2] full branch: [1] https://github.com/isilence/linux.git zcrx/large-buffers-v8 Liburing example: [2] https://github.com/isilence/liburing.git zcrx/rx-buf-len * tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux: io_uring/zcrx: document area chunking parameter selftests: iou-zcrx: test large chunk sizes eth: bnxt: support qcfg provided rx page size eth: bnxt: adjust the fill level of agg queues with larger buffers eth: bnxt: store rx buffer size per queue net: pass queue rx page size from memory provider net: add bare bone queue configs net: reduce indent of struct netdev_queue_mgmt_ops members net: memzero mp params when closing a queue ==================== Link: https://patch.msgid.link/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'"Jakub Kicinski
This reverts commit 77b9c4a438fc66e2ab004c411056b3fb71a54f2c, reversing changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497: 931420a2fc36 ("selftests/net: Add netkit container tests") ab771c938d9a ("selftests/net: Make NetDrvContEnv support queue leasing") 6be87fbb2776 ("selftests/net: Add env for container based tests") 61d99ce3dfc2 ("selftests/net: Add bpf skb forwarding program") 920da3634194 ("netkit: Add xsk support for af_xdp applications") eef51113f8af ("netkit: Add netkit notifier to check for unregistering devices") b5ef109d22d4 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create") b5c3fa4a0b16 ("netkit: Add single device mode for netkit") 0073d2fd679d ("xsk: Proxy pool management for leased queues") 1ecea95dd3b5 ("xsk: Extend xsk_rcv_check validation") 804bf334d08a ("net: Proxy netdev_queue_get_dma_dev for leased queues") 0caa9a8ddec3 ("net: Proxy net_mp_{open,close}_rxq for leased queues") ff8889ff9107 ("net, ethtool: Disallow leased real rxqs to be resized") 9e2103f36110 ("net: Add lease info to queue-get response") 31127deddef4 ("net: Implement netdev_nl_queue_create_doit") a5546e18f77c ("net: Add queue-create operation") The series will conflict with io_uring work, and the code needs more polish. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20bpf: Fix memory access flags in helper prototypesZesen Liu
After commit 37cce22dbd51 ("bpf: verifier: Refactor helper access type tracking"), the verifier started relying on the access type flags in helper function prototypes to perform memory access optimizations. Currently, several helper functions utilizing ARG_PTR_TO_MEM lack the corresponding MEM_RDONLY or MEM_WRITE flags. This omission causes the verifier to incorrectly assume that the buffer contents are unchanged across the helper call. Consequently, the verifier may optimize away subsequent reads based on this wrong assumption, leading to correctness issues. For bpf_get_stack_proto_raw_tp, the original MEM_RDONLY was incorrect since the helper writes to the buffer. Change it to ARG_PTR_TO_UNINIT_MEM which correctly indicates write access to potentially uninitialized memory. Similar issues were recently addressed for specific helpers in commit ac44dcc788b9 ("bpf: Fix verifier assumptions of bpf_d_path's output buffer") and commit 2eb7648558a7 ("bpf: Specify access type of bpf_sysctl_get_name args"). Fix these prototypes by adding the correct memory access flags. Fixes: 37cce22dbd51 ("bpf: verifier: Refactor helper access type tracking") Co-developed-by: Shuran Liu <electronlsr@gmail.com> Signed-off-by: Shuran Liu <electronlsr@gmail.com> Co-developed-by: Peili Gao <gplhust955@gmail.com> Signed-off-by: Peili Gao <gplhust955@gmail.com> Co-developed-by: Haoran Ni <haoran.ni.cs@gmail.com> Signed-off-by: Haoran Ni <haoran.ni.cs@gmail.com> Signed-off-by: Zesen Liu <ftyghome@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260120-helper_proto-v3-1-27b0180b4e77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20Merge tag 'mhi-for-v6.20' of ↵Greg Kroah-Hartman
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mani/mhi into char-misc-next Manivannan writes: MHI Host -------- - Add support for loading dual ELF image format firmware to Qcom Trust Management Engine Lit (TME-L) supported devices like QCC2072, which require separate ELF header for SBL and WLAN firmware segments in a single firmware. - Remove the MHI auto_queue feature support. This feature was added to offload the queuing of buffers from the client drivers to the MHI stack, but it caused a lot of race over the time. So remove this feature from the QRTR client driver and also from the MHI stack/controller drivers. - Move the .probe() and .remove() callbacks from driver level to bus level. MHI Endpoint ------------ - Move the .probe() and .remove() callbacks from driver level to bus level. * tag 'mhi-for-v6.20' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mani/mhi: bus: mhi: ep: Use bus callbacks for .probe() and .remove() bus: mhi: host: Use bus callbacks for .probe() and .remove() bus: mhi: host: Drop the auto_queue support net: qrtr: Drop the MHI auto_queue feature for IPCR DL channels mhi: host: Add support for loading dual ELF image format
2026-01-20netfilter: xt_tcpmss: check remaining length before reading optlenFlorian Westphal
Quoting reporter: In net/netfilter/xt_tcpmss.c (lines 53-68), the TCP option parser reads op[i+1] directly without validating the remaining option length. If the last byte of the option field is not EOL/NOP (0/1), the code attempts to index op[i+1]. In the case where i + 1 == optlen, this causes an out-of-bounds read, accessing memory past the optlen boundary (either reading beyond the stack buffer _opt or the following payload). Reported-by: sungzii <sungzii@pm.me> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conncount: fix tracking of connections from localhostFernando Fernandez Mancera
Since commit be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly"), we skip the adding and trigger a GC when the ct is confirmed. For connections originated from local to local it doesn't work because the connection is confirmed on POSTROUTING, therefore tracking on the INPUT hook is always skipped. In order to fix this, we check whether skb input ifindex is set to loopback ifindex. If it is then we fallback on a GC plus track operation skipping the optimization. This fallback is necessary to avoid duplicated tracking of a packet train e.g 10 UDP datagrams sent on a burst when initiating the connection. Tested with xt_connlimit/nft_connlimit and OVS limit and with a HTTP server and iperf3 on UDP mode. Fixes: be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly") Reported-by: Michal Slabihoudek <michal.slabihoudek@gooddata.com> Closes: https://lore.kernel.org/netfilter/6989BD9F-8C24-4397-9AD7-4613B28BF0DB@gooddata.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nft_compat: add more restrictions on netlink attributesFlorian Westphal
As far as I can see nothing bad can happen when NFTA_TARGET/MATCH_NAME are too large because this calls x_tables helpers which check for the length, but it seems better to already reject it during netlink parsing. Rest of the changes avoid silent u8/u16 truncations. For _TYPE, its expected to be only 1 or 0. In x_tables world, this variable is set by kernel, for IPT_SO_GET_REVISION_TARGET its 1, for all others its set to 0. As older versions of nf_tables permitted any value except 1 to mean 'match', keep this as-is but sanitize the value for consistency. Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT ↵Scott Mitchell
allocation Currently, instance_create() uses GFP_ATOMIC because it's called while holding instances_lock spinlock. This makes allocation more likely to fail under memory pressure. Refactor nfqnl_recv_config() to drop RCU lock after instance_lookup() and peer_portid verification. A socket cannot simultaneously send a message and close, so the queue owned by the sending socket cannot be destroyed while processing its CONFIG message. This allows instance_create() to allocate with GFP_KERNEL_ACCOUNT before taking the spinlock. Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conntrack: don't rely on implicit includesFlorian Westphal
several netfilter compilation units rely on implicit includes coming from nf_conntrack_proto_gre.h. Clean this up and add the required dependencies where needed. nf_conntrack.h requires net_generic() helper. Place various gre/ppp/vlan includes to where they are needed. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: don't include xt and nftables.h in unrelated subsystemsFlorian Westphal
conntrack, xtables and nftables are distinct subsystems, don't use them in other subystems. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conntrack: enable icmp clash supportFlorian Westphal
Not strictly required, but should not be harmful either: This isn't a stateful protocol, hence clash resolution should work fine. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conncount: increase the connection clean up limit to 64Fernando Fernandez Mancera
After the optimization to only perform one GC per jiffy, a new problem was introduced. If more than 8 new connections are tracked per jiffy the list won't be cleaned up fast enough possibly reaching the limit wrongly. In order to prevent this issue, only skip the GC if it was already triggered during the same jiffy and the increment is lower than the clean up limit. In addition, increase the clean up limit to 64 connections to avoid triggering GC too often and do more effective GCs. This has been tested using a HTTP server and several performance tools while having nft_connlimit/xt_connlimit or OVS limit configured. Output of slowhttptest + OVS limit at 52000 connections: slow HTTP test status on 340th second: initializing: 0 pending: 432 connected: 51998 error: 0 closed: 0 service available: YES Fixes: d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC") Reported-by: Aleksandra Rukomoinikova <ARukomoinikova@k2.cloud> Closes: https://lore.kernel.org/netfilter/b2064e7b-0776-4e14-adb6-c68080987471@k2.cloud/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_conntrack: Add allow_clash to generic protocol handlerYuto Hamaguchi
The upstream commit, 71d8c47fc653711c41bc3282e5b0e605b3727956 ("netfilter: conntrack: introduce clash resolution on insertion race"), sets allow_clash=true in the UDP/UDPLITE protocol handler but does not set it in the generic protocol handler. As a result, packets composed of connectionless protocols at each layer, such as UDP over IP-in-IP, still drop packets due to conflicts during conntrack insertion. To resolve this, this patch sets allow_clash in the nf_conntrack_l4proto_generic. Signed-off-by: Yuto Hamaguchi <Hamaguchi.Yuto@da.MitsubishiElectric.co.jp> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20netfilter: nf_tables: reset table validation state on abortFlorian Westphal
If a transaction fails the final validation in the commit hook, the table validation state is changed to NFT_VALIDATE_DO and a replay of the batch is performed. Every rule insert will then do a graph validation. This is much slower, but provides better error reporting to the user because we can point at the rule that introduces the validation issue. Without this reset the affected table(s) remain in full validation mode, i.e. on next transaction we start with slow-mode. This makes the next transaction after a failed incremental update very slow: # time iptables-restore < /tmp/ruleset real 0m0.496s [..] # time iptables -A CALLEE -j CALLER iptables v1.8.11 (nf_tables): RULE_APPEND failed (Too many links): rule in chain CALLEE real 0m0.022s [..] # time iptables-restore < /tmp/ruleset real 1m22.355s [..] After this patch, 2nd iptables-restore is back to ~0.5s. Fixes: 9a32e9850686 ("netfilter: nf_tables: don't write table validation state without mutex") Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20xsk: Proxy pool management for leased queuesDaniel Borkmann
Similarly to the net_mp_{open,close}_rxq handling for leased queues, proxy the xsk_{reg,clear}_pool_at_qid via netif_get_rx_queue_lease_locked such that in case a virtual netdev picked a leased rxq, the request gets through to the real rxq in the physical netdev. The proxying is only relevant for queue_id < dev->real_num_rx_queues since right now its only supported for rxqs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-9-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20xsk: Extend xsk_rcv_check validationDaniel Borkmann
xsk_rcv_check tests for inbound packets to see whether they match the bound AF_XDP socket. Refactor the test into a small helper xsk_dev_queue_valid and move the validation against xs->dev and xs->queue_id there. The fast-path case stays in place and allows for quick return in xsk_dev_queue_valid. If it fails, the validation is extended to check whether the AF_XDP socket is bound against a leased queue, and if the case then the test is redone. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-8-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Proxy netdev_queue_get_dma_dev for leased queuesDavid Wei
Extend netdev_queue_get_dma_dev to return the physical device of the real rxq for DMA in case the queue was leased. This allows memory providers like io_uring zero-copy or devmem to bind to the physically leased rxq via virtual devices such as netkit. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-7-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Proxy net_mp_{open,close}_rxq for leased queuesDavid Wei
When a process in a container wants to setup a memory provider, it will use the virtual netdev and a leased rxq, and call net_mp_{open,close}_rxq to try and restart the queue. At this point, proxy the queue restart on the real rxq in the physical netdev. For memory providers (io_uring zero-copy rx and devmem), it causes the real rxq in the physical netdev to be filled from a memory provider that has DMA mapped memory from a process within a container. Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-6-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net, ethtool: Disallow leased real rxqs to be resizedDaniel Borkmann
Similar to AF_XDP, do not allow queues in a physical netdev to be resized by ethtool -L when they are leased. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-5-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Add lease info to queue-get responseDaniel Borkmann
Populate nested lease info to the queue-get response that returns the ifindex, queue id with type and optionally netns id if the device resides in a different netns. Example with ynl client: # ip a [...] 4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000 link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff inet 10.0.0.2/24 scope global enp10s0f0np0 valid_lft forever preferred_lft forever inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ethtool -i enp10s0f0np0 driver: mlx5_core [...] # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-get \ --json '{"ifindex": 4, "id": 15, "type": "rx"}' {'id': 15, 'ifindex': 4, 'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}}, 'napi-id': 8227, 'type': 'rx', 'xsk': {}} # ip netns list foo (id: 0) # ip netns exec foo ip a [...] 8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll valid_lft forever preferred_lft forever [...] # ip netns exec foo ethtool -i nk driver: netkit [...] # ip netns exec foo ls /sys/class/net/nk/queues/ rx-0 rx-1 tx-0 # ip netns exec foo ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-get \ --json '{"ifindex": 8, "id": 1, "type": "rx"}' {'id': 1, 'ifindex': 8, 'type': 'rx'} Note that the caller of netdev_nl_queue_fill_one() holds the netdevice lock. For the queue-get we do not lock both devices. When queues get {un,}leased, both devices are locked, thus if __netif_get_rx_queue_peer() returns true, the peer pointer points to a valid device. The netns-id is fetched via peernet2id_alloc() similarly as done in OVS. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260115082603.219152-4-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Implement netdev_nl_queue_create_doitDaniel Borkmann
Implement netdev_nl_queue_create_doit which creates a new rx queue in a virtual netdev and then leases it to a rx queue in a physical netdev. Example with ynl client: # ./pyynl/cli.py \ --spec ~/netlink/specs/netdev.yaml \ --do queue-create \ --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}' {'id': 1} Note that the netdevice locking order is always from the virtual to the physical device. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20net: Add queue-create operationDaniel Borkmann
Add a ynl netdev family operation called queue-create that creates a new queue on a netdevice: name: queue-create attribute-set: queue flags: [admin-perm] do: request: attributes: - ifindex - type - lease reply: &queue-create-op attributes: - id This is a generic operation such that it can be extended for various use cases in future. Right now it is mandatory to specify ifindex, the queue type which is enforced to rx and a lease. The newly created queue id is returned to the caller. A queue from a virtual device can have a lease which refers to another queue from a physical device. This is useful for memory providers and AF_XDP operations which take an ifindex and queue id to allow applications to bind against virtual devices in containers. The lease couples both queues together and allows to proxy the operations from a virtual device in a container to the physical device. In future, the nested lease attribute can be lifted and made optional for other use-cases such as dynamic queue creation for physical netdevs. The lack of lease and the specification of the physical device as an ifindex will imply that we need a real queue to be allocated. Similarly, the queue type enforcement to rx can then be lifted as well to support tx. An early implementation had only driver-specific integration [0], but in order for other virtual devices to reuse, it makes sense to have this as a generic API in core net. For leasing queues, the virtual netdev must have real_num_rx_queue less than num_rx_queues at the time of calling queue-create. The queue-type must be rx as only rx queues are supported for leasing for now. We also enforce that the queue-create ifindex must point to a virtual device, and that the nested lease attribute's ifindex must point to a physical device. The nested lease attribute set contains a netns-id attribute which is currently only intended for dumping as part of the queue-get operation. Also, it is modeled as an s32 type similarly as done elsewhere in the stack. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0] Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260115082603.219152-2-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-20wifi: cfg80211: ignore link disabled flag from userspaceBenjamin Berg
When the AP has an advertised TID to Link Mapping (TTLM) it shall include the element in the association response. As such, when this element is present it needs to be used for the currently dormant links. See Draft P802.11REVmf_D1.0 section 35.3.7.2.3 ("Negotiation of TTLM") for the details. The flag is also not usable in case userspace wants to specify a negotiated TTLM during association. Note that for the link reconfiguration case, mac80211 did not use the information. Draft P802.11REVmf_D1.0 states in section 35.3.6.4 ("Link reconfiguration to the setup links) that we "shall operate with all the TIDs mapped to the newly added links ..." All this means that the flag is not needed. The implementation should parse the information from the association response. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260118093904.754e057896a5.Ifd06f5ef839a93bfd54d0593dc932870f95f3242@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-20wifi: mac80211: apply advertised TTLM from association responseBenjamin Berg
When the AP has a disabled link that the station can include in the association, the fact that the link is dormant needs to be advertised in the TID to Link Mapping (TTLM). Section 35.3.7.2.3 ("Negotiation of TTLM") of Draft P802.11REVmf_D1.0 also states that the mapping needs to be included in the association response frame. As such, we can simply rely on the TTLM from the association response. Before this change mac80211 would not properly track that an advertised TTLM was effectively active, resulting in it not enabling the link once it became available again. For the link reconfiguration case, the data was not used at all. This behaviour is actually correct because Draft P802.11REVmf_D1.0 states in section 35.3.6.4 that we "shall operate with all the TIDs mapped to the newly added links ..." Fixes: 6d543b34dbcf ("wifi: mac80211: Support disabled links during association") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260118093904.43c861424543.I067f702ac46b84ac3f8b4ea16fb0db9cbbfae7e2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-20wifi: mac80211: parse all TTLM entriesBenjamin Berg
For the follow up patch, we need to properly parse TTLM entries that do not have a switch time. Change the logic so that ieee80211_parse_adv_t2l returns usable values in all non-error cases. Before the values filled in were technically incorrect but enough for ieee80211_process_adv_ttlm. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260118093904.ccd324e2dd59.I69f0bee0a22e9b11bb95beef313e305dab17c051@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-20wifi: mac80211: don't increment crypto_tx_tailroom_needed_cnt twiceMiri Korenblit
In reconfig, in case the driver asks to disconnect during the reconfig, all the keys of the interface are marked as tainted. Then ieee80211_reenable_keys will loop over all the interface keys, and for each one it will a) increment crypto_tx_tailroom_needed_cnt b) call ieee80211_key_enable_hw_accel, which in turn will detect that this key is tainted, so it will mark it as "not in hardware", which is paired with crypto_tx_tailroom_needed_cnt incrementation, so we get two incrementations for each tainted key. Then we get a warning in ieee80211_free_keys. To fix it, don't increment the count in ieee80211_reenable_keys for tainted keys Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260118092821.4ca111fddcda.Id6e554f4b1c83760aa02d5a9e4e3080edb197aa2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-20wifi: mac80211: don't perform DA check on S1G beaconLachlan Hodges
S1G beacons don't contain the DA field as per IEEE80211-2024 9.3.4.3, so the DA broadcast check reads the SA address of the S1G beacon which will subsequently lead to the beacon being dropped. As a result, passive scanning is not possible. Fix this by only performing the check on non-S1G beacons to allow S1G long beacons to be processed during a passive scan. Fixes: ddf82e752f8a ("wifi: mac80211: Allow beacons to update BSS table regardless of scan") Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20260120031122.309942-1-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-19net/sched: qfq: Use cl_is_active to determine whether class is active in ↵Jamal Hadi Salim
qfq_rm_from_ag This is more of a preventive patch to make the code more consistent and to prevent possible exploits that employ child qlen manipulations on qfq. use cl_is_active instead of relying on the child qdisc's qlen to determine class activation. Fixes: 462dbc9101acd ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260114160243.913069-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19net/sched: Enforce that teql can only be used as root qdiscJamal Hadi Salim
Design intent of teql is that it is only supposed to be used as root qdisc. We need to check for that constraint. Although not important, I will describe the scenario that unearthed this issue for the curious. GangMin Kim <km.kim1503@gmail.com> managed to concot a scenario as follows: ROOT qdisc 1:0 (QFQ) ├── class 1:1 (weight=15, lmax=16384) netem with delay 6.4s └── class 1:2 (weight=1, lmax=1514) teql GangMin sends a packet which is enqueued to 1:1 (netem). Any invocation of dequeue by QFQ from this class will not return a packet until after 6.4s. In the meantime, a second packet is sent and it lands on 1:2. teql's enqueue will return success and this will activate class 1:2. Main issue is that teql only updates the parent visible qlen (sch->q.qlen) at dequeue. Since QFQ will only call dequeue if peek succeeds (and teql's peek always returns NULL), dequeue will never be called and thus the qlen will remain as 0. With that in mind, when GangMin updates 1:2's lmax value, the qfq_change_class calls qfq_deact_rm_from_agg. Since the child qdisc's qlen was not incremented, qfq fails to deactivate the class, but still frees its pointers from the aggregate. So when the first packet is rescheduled after 6.4 seconds (netem's delay), a dangling pointer is accessed causing GangMin's causing a UAF. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: GangMin Kim <km.kim1503@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260114160243.913069-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19rxrpc: Fix recvmsg() unconditional requeueDavid Howells
If rxrpc_recvmsg() fails because MSG_DONTWAIT was specified but the call at the front of the recvmsg queue already has its mutex locked, it requeues the call - whether or not the call is already queued. The call may be on the queue because MSG_PEEK was also passed and so the call was not dequeued or because the I/O thread requeued it. The unconditional requeue may then corrupt the recvmsg queue, leading to things like UAFs or refcount underruns. Fix this by only requeuing the call if it isn't already on the queue - and moving it to the front if it is already queued. If we don't queue it, we have to put the ref we obtained by dequeuing it. Also, MSG_PEEK doesn't dequeue the call so shouldn't call rxrpc_notify_socket() for the call if we didn't use up all the data on the queue, so fix that also. Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg") Reported-by: Faith <faith@zellic.io> Reported-by: Pumpkin Chang <pumpkin@devco.re> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Dionne <marc.dionne@auristor.com> cc: Nir Ohfeld <niro@wiz.io> cc: Willy Tarreau <w@1wt.eu> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/95163.1768428203@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19ipv6: annotate data-races in net/ipv6/route.cEric Dumazet
sysctls are read while their values can change, add READ_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19ipv6: exthdrs: annotate data-race over multiple sysctlEric Dumazet
Following four sysctls can change under us, add missing READ_ONCE(). - ipv6.sysctl.max_dst_opts_len - ipv6.sysctl.max_dst_opts_cnt - ipv6.sysctl.max_hbh_opts_len - ipv6.sysctl.max_hbh_opts_cnt Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19ipv6: annotate data-races around sysctl.ip6_rt_gc_intervalEric Dumazet
Add READ_ONCE() on lockless reads of net->ipv6.sysctl.ip6_rt_gc_interval Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19ipv6: annotate data-races over sysctl.flowlabel_reflectEric Dumazet
Add missing READ_ONCE() when reading ipv6.sysctl.flowlabel_reflect, as its value can be changed under us. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19l2tp: avoid one data-race in l2tp_tunnel_del_work()Eric Dumazet
We should read sk->sk_socket only when dealing with kernel sockets. syzbot reported the following data-race: BUG: KCSAN: data-race in l2tp_tunnel_del_work / sk_common_release write to 0xffff88811c182b20 of 8 bytes by task 5365 on cpu 0: sk_set_socket include/net/sock.h:2092 [inline] sock_orphan include/net/sock.h:2118 [inline] sk_common_release+0xae/0x230 net/core/sock.c:4003 udp_lib_close+0x15/0x20 include/net/udp.h:325 inet_release+0xce/0xf0 net/ipv4/af_inet.c:437 __sock_release net/socket.c:662 [inline] sock_close+0x6b/0x150 net/socket.c:1455 __fput+0x29b/0x650 fs/file_table.c:468 ____fput+0x1c/0x30 fs/file_table.c:496 task_work_run+0x131/0x1a0 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:44 [inline] exit_to_user_mode_loop+0x1fe/0x740 kernel/entry/common.c:75 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] do_syscall_64+0x1e1/0x2b0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff88811c182b20 of 8 bytes by task 827 on cpu 1: l2tp_tunnel_del_work+0x2f/0x1a0 net/l2tp/l2tp_core.c:1418 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0x4ce/0x9d0 kernel/workqueue.c:3340 worker_thread+0x582/0x770 kernel/workqueue.c:3421 kthread+0x489/0x510 kernel/kthread.c:463 ret_from_fork+0x149/0x290 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 value changed: 0xffff88811b818000 -> 0x0000000000000000 Fixes: d00fa9adc528 ("l2tp: fix races with tunnel socket close") Reported-by: syzbot+7312e82745f7fa2526db@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6968b029.050a0220.58bed.0016.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: James Chapman <jchapman@katalix.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20260115092139.3066180-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-19wifi: mac80211: mark iface work SKBs as consumedJohannes Berg
Using kfree_skb() here is misleading when looking at traces, since these frames have been handled. Use consume_skb() instead. Link: https://patch.msgid.link/20260116092115.1db534bdc12c.Ic0adae06684a6871144398d15cf7700c57620baa@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-19wifi: mac80211: remove RX_DROPJohannes Berg
Since it's hard to figure out what RX_DROP means when looking at traces that drop packets in mac80211, add more specific drop reasons and remove RX_DROP entirely. Link: https://patch.msgid.link/20260116092025.79d995e87026.I7cde413988f7a382c551cd1c1e2b05a52ec71755@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-19wifi: nl80211: ignore cluster id after NAN startedMiri Korenblit
After NAN was started, cluster id updates from the user space should not happen, since the device already started a cluster with the previousely provided id. Since NL80211_CMD_CHANGE_NAN_CONFIG requires to set the full NAN configuration, we can't require that NL80211_NAN_CONF_CLUSTER_ID won't be included in this command, and keeping the last confgiured value just to be able to compare it against the new one seems a bit overkill. Therefore, just ignore cluster id in this command and clarify the documentation. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260107142229.fb55e5853269.I10d18c8f69d98b28916596d6da4207c15ea4abb5@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-19wifi: cfg80211: cleanup cluster_id when stopping NANMiri Korenblit
When NAN is stopped, cluster_id should be set to 0 to indicate that we are not part of any cluster. Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260107142229.9ccb700797ec.I890ac852be6ca0093995655d987ca5c28a26ce3d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-19wifi: cfg80211: limit NAN func management APIs to offloaded DEMiri Korenblit
A driver that declared that it has userspace DE should not call NAN func related APIs such as cfg80211_nan_match and cfg80211_nan_func_terminated Check and warn in such a case, as this indicates a driver bug. Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260107141549.86fa96c75211.I8fbb0506377170dd7b41234f20bcba057951dd1e@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-19wifi: cfg80211: stop NAN and P2P in cfg80211_leaveMiri Korenblit
Seems that there is an assumption that this function should be called only for netdev interfaces, but it can also be called in suspend, or from nl80211_netlink_notify (indirectly). Note that the documentation of NL80211_ATTR_SOCKET_OWNER explicitly says that NAN interfaces would be destroyed as well in the nl80211_netlink_notify case. Fix this by also stopping P2P and NAN. Fixes: cb3b7d87652a ("cfg80211: add start / stop NAN commands") Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260107140430.dab142cbef0b.I290cc47836d56dd7e35012ce06bec36c6da688cd@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-19wifi: cfg80211: allow only one NAN interface, also in multi radioMiri Korenblit
According to Wi-Fi Aware (TM) 4.0 specification 2.8, A NAN device can have one NAN management interface. This applies also to multi radio devices. The current code allows a driver to support more than one NAN interface, if those are not in the same radio. Fix it. Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20260107135129.fdaecec0fe8a.I246b5ba6e9da3ec1481ff197e47f6ce0793d7118@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-01-17fou: Don't allow 0 for FOU_ATTR_IPPROTO.Kuniyuki Iwashima
fou_udp_recv() has the same problem mentioned in the previous patch. If FOU_ATTR_IPPROTO is set to 0, skb is not freed by fou_udp_recv() nor "resubmit"-ted in ip_protocol_deliver_rcu(). Let's forbid 0 for FOU_ATTR_IPPROTO. Fixes: 23461551c0062 ("fou: Support for foo-over-udp RX path") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260115172533.693652-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17gue: Fix skb memleak with inner IP protocol 0.Kuniyuki Iwashima
syzbot reported skb memleak below. [0] The repro generated a GUE packet with its inner protocol 0. gue_udp_recv() returns -guehdr->proto_ctype for "resubmit" in ip_protocol_deliver_rcu(), but this only works with non-zero protocol number. Let's drop such packets. Note that 0 is a valid number (IPv6 Hop-by-Hop Option). I think it is not practical to encap HOPOPT in GUE, so once someone starts to complain, we could pass down a resubmit flag pointer to distinguish two zeros from the upper layer: * no error * resubmit HOPOPT [0] BUG: memory leak unreferenced object 0xffff888109695a00 (size 240): comm "syz.0.17", pid 6088, jiffies 4294943096 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 40 c2 10 81 88 ff ff 00 00 00 00 00 00 00 00 .@.............. backtrace (crc a84b336f): kmemleak_alloc_recursive include/linux/kmemleak.h:44 [inline] slab_post_alloc_hook mm/slub.c:4958 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x3b4/0x590 mm/slub.c:5270 __build_skb+0x23/0x60 net/core/skbuff.c:474 build_skb+0x20/0x190 net/core/skbuff.c:490 __tun_build_skb drivers/net/tun.c:1541 [inline] tun_build_skb+0x4a1/0xa40 drivers/net/tun.c:1636 tun_get_user+0xc12/0x2030 drivers/net/tun.c:1770 tun_chr_write_iter+0x71/0x120 drivers/net/tun.c:1999 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x45d/0x710 fs/read_write.c:686 ksys_write+0xa7/0x170 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xa4/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 37dd0247797b1 ("gue: Receive side for Generic UDP Encapsulation") Reported-by: syzbot+4d8c7d16b0e95c0d0f0d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6965534b.050a0220.38aacd.0001.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260115172533.693652-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17tcp: move tcp_rate_skb_sent() to tcp_output.cEric Dumazet
It is only called from __tcp_transmit_skb() and __tcp_retransmit_skb(). Move it in tcp_output.c and make it static. clang compiler is now able to inline it from __tcp_transmit_skb(). gcc compiler inlines it in the two callers, which is also fine. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260114165109.1747722-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17net/sched: cake: avoid separate allocation of struct cake_sched_configToke Høiland-Jørgensen
Paolo pointed out that we can avoid separately allocating struct cake_sched_config even in the non-mq case, by embedding it into struct cake_sched_data. This reduces the complexity of the logic that swaps the pointers and frees the old value, at the cost of adding 56 bytes to the latter. Since cake_sched_data is already almost 17k bytes, this seems like a reasonable tradeoff. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Fixes: bc0ce2bad36c ("net/sched: sch_cake: Factor out config variables into separate struct") Link: https://patch.msgid.link/20260113143157.2581680-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17sctp: move SCTP_CMD_ASSOC_SHKEY right after SCTP_CMD_PEER_INITXin Long
A null-ptr-deref was reported in the SCTP transmit path when SCTP-AUTH key initialization fails: ================================================================== KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G W 6.6.0 #2 RIP: 0010:sctp_packet_bundle_auth net/sctp/output.c:264 [inline] RIP: 0010:sctp_packet_append_chunk+0xb36/0x1260 net/sctp/output.c:401 Call Trace: sctp_packet_transmit_chunk+0x31/0x250 net/sctp/output.c:189 sctp_outq_flush_data+0xa29/0x26d0 net/sctp/outqueue.c:1111 sctp_outq_flush+0xc80/0x1240 net/sctp/outqueue.c:1217 sctp_cmd_interpreter.isra.0+0x19a5/0x62c0 net/sctp/sm_sideeffect.c:1787 sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x1a3/0x670 net/sctp/sm_sideeffect.c:1169 sctp_assoc_bh_rcv+0x33e/0x640 net/sctp/associola.c:1052 sctp_inq_push+0x1dd/0x280 net/sctp/inqueue.c:88 sctp_rcv+0x11ae/0x3100 net/sctp/input.c:243 sctp6_rcv+0x3d/0x60 net/sctp/ipv6.c:1127 The issue is triggered when sctp_auth_asoc_init_active_key() fails in sctp_sf_do_5_1C_ack() while processing an INIT_ACK. In this case, the command sequence is currently: - SCTP_CMD_PEER_INIT - SCTP_CMD_TIMER_STOP (T1_INIT) - SCTP_CMD_TIMER_START (T1_COOKIE) - SCTP_CMD_NEW_STATE (COOKIE_ECHOED) - SCTP_CMD_ASSOC_SHKEY - SCTP_CMD_GEN_COOKIE_ECHO If SCTP_CMD_ASSOC_SHKEY fails, asoc->shkey remains NULL, while asoc->peer.auth_capable and asoc->peer.peer_chunks have already been set by SCTP_CMD_PEER_INIT. This allows a DATA chunk with auth = 1 and shkey = NULL to be queued by sctp_datamsg_from_user(). Since command interpretation stops on failure, no COOKIE_ECHO should been sent via SCTP_CMD_GEN_COOKIE_ECHO. However, the T1_COOKIE timer has already been started, and it may enqueue a COOKIE_ECHO into the outqueue later. As a result, the DATA chunk can be transmitted together with the COOKIE_ECHO in sctp_outq_flush_data(), leading to the observed issue. Similar to the other places where it calls sctp_auth_asoc_init_active_key() right after sctp_process_init(), this patch moves the SCTP_CMD_ASSOC_SHKEY immediately after SCTP_CMD_PEER_INIT, before stopping T1_INIT and starting T1_COOKIE. This ensures that if shared key generation fails, authenticated DATA cannot be sent. It also allows the T1_INIT timer to retransmit INIT, giving the client another chance to process INIT_ACK and retry key setup. Fixes: 730fc3d05cd4 ("[SCTP]: Implete SCTP-AUTH parameter processing") Reported-by: Zhen Chen <chenzhen126@huawei.com> Tested-by: Zhen Chen <chenzhen126@huawei.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/44881224b375aa8853f5e19b4055a1a56d895813.1768324226.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>