| Age | Commit message (Collapse) | Author |
|
Before this patch, retransmitted SYN/ACK did not have a specific
synack_type; however, the upcoming patch needs to distinguish between
retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to
transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this
patch introduces a new synack_type (TCP_SYNACK_RETRANS).
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Based on AccECN spec (RFC9768) Section 3.1.4.1, if the sender of an
AccECN SYN (the TCP Client) times out before receiving the SYN/ACK, it
SHOULD attempt to negotiate the use of AccECN at least one more time
by continuing to set all three TCP ECN flags (AE,CWR,ECE) = (1,1,1) on
the first retransmitted SYN (using the usual retransmission time-outs).
If this first retransmission also fails to be acknowledged, in
deployment scenarios where AccECN path traversal might be problematic,
the TCP Client SHOULD send subsequent retransmissions of the SYN with
the three TCP-ECN flags cleared (AE,CWR,ECE) = (0,0,0).
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-8-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When AccECN is not successfully negociated for a TCP flow, it defaults
fallback to classic ECN (RFC3168). However, L4S service will fallback
to non-ECN.
This patch enables congestion control module to control whether it
should not fallback to classic ECN after unsuccessful AccECN negotiation.
A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this
behavior expected by the CA.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-6-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Two flags for congestion control (CC) module are added in this patch
related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN)
defines that the CC expects to negotiate AccECN functionality using the
ECE, CWR and AE flags in the TCP header.
Second, during ECN negotiation, ECT(0) in the IP header is used. This
patch enables CC to control whether ECT(0) or ECT(1) should be used on
a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the
expected ECT value in the IP header by the CA when not-yet initialized
for the connection.
The detailed AccECN negotiaotn can be found in IETF RFC9768.
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
As AccECN may keep CWR bit asserted due to different
interpretation of the bit, flushing with GRO because of
CWR may effectively disable GRO until AccECN counter
field changes such that CWR-bit becomes 0.
There is no harm done from not immediately forwarding the
CWR'ed segment with RFC3168 ECN.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add newly acked pkts EWMA. When ACK thinning occurs, select
between safer and unsafe cep delta in AccECN processing based
on it. If the packets ACKed per ACK tends to be large, don't
conservatively assume ACE field overflow.
This patch uses the existing 2-byte holes in the rx group for new
u16 variables withtout creating more holes. Below are the pahole
outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ecn_bytes[3]; /* 2744 12 */
/* XXX 4 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2816 0 */
[...]
/* size: 3264, cachelines: 51, members: 177 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ecn_bytes[3]; /* 2744 12 */
u16 pkts_acked_ewma; /* 2756 2 */
/* XXX 2 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2816 0 */
[...]
/* size: 3264, cachelines: 51, members: 178 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Required by DCB/QoS support of the switch driver, since the rx packets
will have non-zero priorities.
Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260131021854.3405036-3-mmyangfl@gmail.com
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Packet priority is part of the tag, and the priority and code fields are
used by tx and rx. Make revisions to reflect the facts.
Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260131021854.3405036-2-mmyangfl@gmail.com
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
devlink_rate_nodes_check() was used to verify there are no devlink rate
nodes created when switching the esw mode.
Rate management code is about to become more complex, so refactor this
function:
- remove unused param 'mode'.
- add a new 'rate_filter' param.
- rename to devlink_rates_check().
- expose devlink_rate_is_node() to be used as a rate filter.
This makes it more usable from multiple places, so use it from those
places as well.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260128112544.1661250-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit [1] defined the locking expectations for nested devlink
instances: the nested-in devlink instance lock needs to be acquired
before the nested devlink instance lock. The code handling devlink rels
was architected with that assumption in mind.
There are no actual users of double locking yet but that is about to
change in the upcoming patches in the series.
Code operating on nested devlink instances will require also obtaining
the nested-in instance lock, but such code may already be called from a
variety of places with the nested devlink instance lock. Then, there's
no way to acquire the nested-in lock other than making sure that all
callers acquire it first.
Reversing the nested lock order allows incrementally acquiring the
nested-in instance lock when needed (perhaps even a chain of locks up to
the root) without affecting any caller.
The only affected use of nesting is devlink_nl_nested_fill(), which
iterates over nested devlink instances with the RCU lock, without
locking them, so there's no possibility of deadlock.
So this commit just updates a comment regarding the nested locks.
[1] commit c137743bce02b ("devlink: introduce object and nested devlink
relationship infra")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260128112544.1661250-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch implements .splice_read interface of mptcp struct proto_ops
as mptcp_splice_read() with reference to tcp_splice_read().
Corresponding to __tcp_splice_read(), __mptcp_splice_read() is defined,
invoking mptcp_read_sock() instead of tcp_read_sock().
mptcp_splice_read() is almost the same as tcp_splice_read(), except for
sock_rps_record_flow().
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-4-31332ba70d7f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h
so that they can be used by MPTCP in the next patch.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-3-31332ba70d7f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Current in-kernel TCP sockets -- i.e. from nvme_tcp_try_recv() -- need
to call .read_sock interface of struct proto_ops, but it's not
implemented in MPTCP.
This patch implements it with reference to __tcp_read_sock() and
__mptcp_recvmsg_mskq().
Corresponding to tcp_recv_skb(), a new helper for MPTCP named
mptcp_recv_skb() is added to peek a skb from sk->sk_receive_queue.
Compared with __mptcp_recvmsg_mskq(), mptcp_read_sock() uses
sk->sk_rcvbuf as the max read length. The LISTEN status is checked
before the while loop, and mptcp_recv_skb() and mptcp_cleanup_rbuf()
are invoked after the loop. In the loop, all flags checks for
__mptcp_recvmsg_mskq() are removed.
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-2-31332ba70d7f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch extracts the free skb related code in __mptcp_recvmsg_mskq()
into a new helper mptcp_eat_recv_skb().
This new helper will be used in the next patch.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-1-31332ba70d7f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
All inet6_cork users also use one inet_cork_full.
Reduce number of parameters and increase data locality.
This saves ~275 bytes of code on x86_64.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we expect an IPv4 dst, use dst4_mtu() instead of dst_mtu()
to save some code space.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we expect an IPv6 dst, use dst6_mtu() instead of dst_mtu()
to save some code space.
Due to current dst6_mtu() implementation, only convert
users in IPv6 stack.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a too big packet is dropped, use SKB_DROP_REASON_PKT_TOO_BIG.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ip6_xmit() makes sure there is enough headroom in the skb,
it can uses __skb_push() instead of the out-of-line skb_push().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
1) daddr is unlikely a multicast in ip6_finish_output2().
2) ip6_finish_output_gso_slowpath_drop() should not be called often.
3) ip6_fragment() should not be called often.
4) opt is unlikely to be set.
5) ip6_xmit() and ip6_forward() mostly sends not too big packets.
6) Most __ip6_make_skb() calls are for UDP packets,
not ICMPV6 ones.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With CONFIG_STACKPROTECTOR_STRONG=y, it is better to avoid passing
a pointer to an automatic variable.
Change these exported functions to return 'u8 proto'
instead of void.
- ipv6_push_nfrag_opts()
- ipv6_push_frag_opts()
For instance, replace
ipv6_push_frag_opts(skb, opt, &proto);
with:
proto = ipv6_push_frag_opts(skb, opt, proto);
Note that even after this change, ip6_xmit() has to use a stack canary
because of @first_hop variable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The rx->skey field contains a struct tipc_aead_key with GCM-AES
encryption keys used for TIPC cluster communication. Using plain
kfree() leaves this sensitive key material in freed memory pages
where it could potentially be recovered.
Switch to kfree_sensitive() to ensure the key material is zeroed
before the memory is freed.
Fixes: 1ef6f7c9390f ("tipc: add automatic session key exchange")
Signed-off-by: Daniel Hodges <hodgesd@meta.com>
Link: https://patch.msgid.link/20260131180114.2121438-1-hodgesd@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Initializing input_xfrm to RXH_XFRM_NO_CHANGE in RSS contexts is
problematic. I think I did this to make it clear that the context
does not have its own settings applied. But unlike ETH_RSS_HASH_NO_CHANGE
which is zero, RXH_XFRM_NO_CHANGE is 0xff. We need to be careful
when reading the value back, and remember to treat 0xff as 0.
Remove the initialization and switch to storing 0. This lets us
also remove the workaround in ethnl_rss_set(). Get side does not
need any adjustments and context get no longer reports:
RSS input transformation:
symmetric-xor: on
symmetric-or-xor: on
Unknown bits in RSS input transformation: 0xfc
for NICs which don't support input_xfrm.
Remove the init of hfunc to ETH_RSS_HASH_NO_CHANGE while at it.
As already mentioned this is a noop since ETH_RSS_HASH_NO_CHANGE
is 0 and struct is zalloc'd. But as this fix exemplifies storing
NO_CHANGE as state is fragile.
This issue is implicitly caught by running our selftests because
YNL in selftests errors out on unknown bits.
Fixes: d3e2c7bab124 ("ethtool: rss: support setting input-xfrm via Netlink")
Link: https://patch.msgid.link/20260130190311.811129-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After linkwatch_do_dev() calls __dev_put() to release the linkwatch
reference, the device refcount may drop to 1. At this point,
netdev_run_todo() can proceed (since linkwatch_sync_dev() sees an
empty list and returns without blocking), wait for the refcount to
become 1 via netdev_wait_allrefs_any(), and then free the device
via kobject_put().
This creates a use-after-free when __linkwatch_run_queue() tries to
call netdev_unlock_ops() on the already-freed device.
Note that adding netdev_lock_ops()/netdev_unlock_ops() pair in
netdev_run_todo() before kobject_put() would not work, because
netdev_lock_ops() is conditional - it only locks when
netdev_need_ops_lock() returns true. If the device doesn't require
ops_lock, linkwatch won't hold any lock, and netdev_run_todo()
acquiring the lock won't provide synchronization.
Fix this by moving __dev_put() from linkwatch_do_dev() to its
callers. The device reference logically pairs with de-listing the
device, so it's reasonable for the caller that did the de-listing
to release it. This allows placing __dev_put() after all device
accesses are complete, preventing UAF.
The bug can be reproduced by adding mdelay(2000) after
linkwatch_do_dev() in __linkwatch_run_queue(), then running:
ip tuntap add mode tun name tun_test
ip link set tun_test up
ip link set tun_test carrier off
ip link set tun_test carrier on
sleep 0.5
ip tuntap del mode tun name tun_test
KASAN report:
==================================================================
BUG: KASAN: use-after-free in netdev_need_ops_lock include/net/netdev_lock.h:33 [inline]
BUG: KASAN: use-after-free in netdev_unlock_ops include/net/netdev_lock.h:47 [inline]
BUG: KASAN: use-after-free in __linkwatch_run_queue+0x865/0x8a0 net/core/link_watch.c:245
Read of size 8 at addr ffff88804de5c008 by task kworker/u32:10/8123
CPU: 0 UID: 0 PID: 8123 Comm: kworker/u32:10 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: events_unbound linkwatch_event
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x156/0x4c9 mm/kasan/report.c:482
kasan_report+0xdf/0x1a0 mm/kasan/report.c:595
netdev_need_ops_lock include/net/netdev_lock.h:33 [inline]
netdev_unlock_ops include/net/netdev_lock.h:47 [inline]
__linkwatch_run_queue+0x865/0x8a0 net/core/link_watch.c:245
linkwatch_event+0x8f/0xc0 net/core/link_watch.c:304
process_one_work+0x9c2/0x1840 kernel/workqueue.c:3257
process_scheduled_works kernel/workqueue.c:3340 [inline]
worker_thread+0x5da/0xe40 kernel/workqueue.c:3421
kthread+0x3b3/0x730 kernel/kthread.c:463
ret_from_fork+0x754/0xaf0 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
</TASK>
==================================================================
Fixes: 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE")
Reported-by: syzbot+1ec2f6a450f0b54af8c8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6824d064.a70a0220.3e9d8.001a.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260201135915.393451-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce support in AP mode for parsing of the Operating Mode Notification
frame sent by the client to enable/disable MLO eMLSR or eMLMR if supported
by both the AP and the client.
Add drv_set_eml_op_mode mac80211 callback in order to configure underlay
driver with eMLSR/eMLMR info.
Tested-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260129-mac80211-emlsr-v4-1-14bdadf57380@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add support for making UHR connections and accepting AP
stations with UHR support.
Link: https://patch.msgid.link/20260130164259.7185980484eb.Ieec940b58dbf8115dab7e1e24cb5513f52c8cb2f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add initial support for making UHR connections (or suppressing
that), adding UHR capable stations on the AP side, encoding
and decoding UHR MCSes (except rate calculation for the new
MCSes 17, 19, 20 and 23) as well as regulatory support.
Link: https://patch.msgid.link/20260130164259.54cc12fbb307.I26126bebd83c7ab17e99827489f946ceabb3521f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Prior initiating communication in a DFS channel, there should be a
monitoring of RADAR in that channel for a minimum of 600 seconds if it
is a Weather RADAR channel and 60 seconds for other DFS channels. This
Channel Availability Check(CAC) is currently implemented by scheduling
a work item for execution with a delay equal to an appropriate timeout.
But this work item is observed to take more delay than specified
(4-5 seconds in regular DFS channels and 25-30 seconds in Weather RADAR
channels). Even though this delay is expected in case of delayed work
queue as there is no guarantee that the work will be scheduled exactly
after the specified delay, a delay of more than 20 seconds is too much
for the AP to be in non-operational state.
Recently commit 7ceba45a6658 ("wifi: cfg80211: add an hrtimer based
delayed work item") added an infrastructure to overcome this issue by
supporting high resolution timers for mac80211 delayed work, which do not
have this timeout latency. Switch the CAC timeout to use this
infrastructure, so the CAC completion handling is triggered with tighter
timing and reduced latency.
Signed-off-by: Amith A <amith.a@oss.qualcomm.com>
Link: https://patch.msgid.link/20260130035511.2328713-1-amith.a@oss.qualcomm.com
[fix delay handling]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
In packet offload, hardware is responsible to check templates. The
result of its operation is forwarded through secpath by relevant
drivers. That secpath is actually removed in __xfrm_policy_check2().
In case packet is forwarded, this secpath is reset in RX, but pushed
again to TX where policy is rechecked again against dummy secpath
in xfrm_policy_ok().
Such situation causes to unexpected XfrmInTmplMismatch increase.
As a solution, simply skip template mismatch check.
Fixes: 600258d555f0 ("xfrm: delete intermediate secpath entry in packet offload mode")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
xfrm6_get_saddr() does not check the return value of
ipv6_dev_get_saddr(). When ipv6_dev_get_saddr() fails to find a suitable
source address (returns -EADDRNOTAVAIL), saddr->in6 is left
uninitialized, but xfrm6_get_saddr() still returns 0 (success).
This causes the caller xfrm_tmpl_resolve_one() to use the uninitialized
address in xfrm_state_find(), triggering KMSAN warning:
=====================================================
BUG: KMSAN: uninit-value in xfrm_state_find+0x2424/0xa940
xfrm_state_find+0x2424/0xa940
xfrm_resolve_and_create_bundle+0x906/0x5a20
xfrm_lookup_with_ifid+0xcc0/0x3770
xfrm_lookup_route+0x63/0x2b0
ip_route_output_flow+0x1ce/0x270
udp_sendmsg+0x2ce1/0x3400
inet_sendmsg+0x1ef/0x2a0
__sock_sendmsg+0x278/0x3d0
__sys_sendto+0x593/0x720
__x64_sys_sendto+0x130/0x200
x64_sys_call+0x332b/0x3e70
do_syscall_64+0xd3/0xf80
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Local variable tmp.i.i created at:
xfrm_resolve_and_create_bundle+0x3e3/0x5a20
xfrm_lookup_with_ifid+0xcc0/0x3770
=====================================================
Fix by checking the return value of ipv6_dev_get_saddr() and propagating
the error.
Fixes: a1e59abf8249 ("[XFRM]: Fix wildcard as tunnel source")
Reported-by: syzbot+e136d86d34b42399a8b1@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68bf1024.a70a0220.7a912.02c2.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
While making some maps in Cilium read-only from the BPF side, we noticed
that the bpf_xdp_store_bytes proto is incorrect. In particular, the
verifier was throwing the following error:
; ret = ctx_store_bytes(ctx, l3_off + offsetof(struct iphdr, saddr),
&nat->address, 4, 0);
635: (79) r1 = *(u64 *)(r10 -144) ; R1=ctx() R10=fp0 fp-144=ctx()
636: (b4) w2 = 26 ; R2=26
637: (b4) w4 = 4 ; R4=4
638: (b4) w5 = 0 ; R5=0
639: (85) call bpf_xdp_store_bytes#190
write into map forbidden, value_size=6 off=0 size=4
nat comes from a BPF_F_RDONLY_PROG map, so R3 is a PTR_TO_MAP_VALUE.
The verifier checks the helper's memory access to R3 in
check_mem_size_reg, as it reaches ARG_CONST_SIZE argument. The third
argument has expected type ARG_PTR_TO_UNINIT_MEM, which includes the
MEM_WRITE flag. The verifier thus checks for a BPF_WRITE access on R3.
Given R3 points to a read-only map, the check fails.
Conversely, ARG_PTR_TO_UNINIT_MEM can also lead to the helper reading
from uninitialized memory.
This patch simply fixes the expected argument type to match that of
bpf_skb_store_bytes.
Fixes: 3f364222d032 ("net: xdp: introduce bpf_xdp_pointer utility routine")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/9fa3c9f72d806e82541071c4df88b8cba28ad6a9.1769875479.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Gal reports that BPF redirect increments dev->stats.tx_errors
on failure. This is not correct, most modern drivers completely
ignore dev->stats so these drops will be invisible to the user.
Core code should use the dedicated core stats which are folded
into device stats in dev_get_stats().
Note that we're switching from tx_errors to tx_dropped.
Core only has tx_dropped, hence presumably users already expect
that counter to increment for "stack" Tx issues.
Reported-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/c5df3b60-246a-4030-9c9a-0a35cd1ca924@nvidia.com
Fixes: b4ab31414970 ("bpf: Add redirect_neigh helper as redirect drop-in")
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260130033827.698841-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This reverts commit d7cd421da9da2cc7b4d25b8537f66db5c8331c40.
As reported by Al Viro, the TCP ULP support for SMC is fundamentally
broken. The implementation attempts to convert an active TCP socket
into an SMC socket by modifying the underlying `struct file`, dentry,
and inode in-place, which violates core VFS invariants that assume
these structures are immutable for an open file, creating a risk of
use after free errors and general system instability.
Given the severity of this design flaw and the fact that cleaner
alternatives (e.g., LD_PRELOAD, BPF) exist for legacy application
transparency, the correct course of action is to remove this feature
entirely.
Fixes: d7cd421da9da ("net/smc: Introduce TCP ULP support")
Link: https://lore.kernel.org/netdev/Yus1SycZxcd+wHwz@ZenIV/
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260128055452.98251-1-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The AX25_DAMA_MASTER option has been unimplemented and marked broken
ever since it was introduced in 2007 in commit 954b2e7f4c37 ("[NET]
AX.25 Kconfig and docs updates and fixes"). At this point, it is very
unlikely it will be implemented. Remove it.
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260129080908.44710-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This commit adds SNMP drop count increment for the packets in
per NUMA queues which were introduced in commit b650bf0977d3
("udp: remove busylock and add per NUMA queues"). note that SNMP
counters are incremented currently by the caller for skb. And
that these skbs on the intermediate queue cannot be counted
there so need similar logic in their error path.
Signed-off-by: Mahdi Faramarzpour <mahdifrmx@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260129083806.204752-1-mahdifrmx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU,
slub has to use extra storage for the freelist pointer after each
object, because slub assumes that any bit in the object
can be used by RCU readers.
Because proto_register() is also using SLAB_HWCACHE_ALIGN,
this forces slub to use one extra cache line per object.
We can instead put the slub freelist anywhere in the object,
granted the concurrent RCU readers are not supposed to
use the pointer value.
Add a new (struct sock)sk_freeptr field, in an union
with sk_rcu: No RCU readers would need to look at sk_rcu,
which is only used at free phase.
Tested:
grep . /sys/kernel/slab/TCP/{object_size,slab_size,objs_per_slab}
grep . /sys/kernel/slab/TCPv6/{object_size,slab_size,objs_per_slab}
Before:
/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2432
/sys/kernel/slab/TCP/objs_per_slab:13
/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2560
/sys/kernel/slab/TCPv6/objs_per_slab:12
After this patch, we can pack one more TCPv6 object per slab,
and object_size == slab_size.
/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2368
/sys/kernel/slab/TCP/objs_per_slab:13
/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2496
/sys/kernel/slab/TCPv6/objs_per_slab:13
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260129153458.4163797-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Another fairly large set of changes, notably:
- cfg80211/mac80211
- most of EPPKE/802.1X over auth frames support
- additional FTM capabilities
- split up drop reasons better, removing generic RX_DROP
- NAN cleanups/fixes
- ath11k:
- support for Channel Frequency Response measurement
- ath12k:
- support for the QCC2072 chipset
- iwlwifi:
- partial NAN support
- UNII-9 support
- some UHR/802.11bn FW APIs
- remove most of MLO/EHT from iwlmvm
(such devices use iwlmld)
- rtw89:
- preparations for RTL8922DE support
* tag 'wireless-next-2026-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (184 commits)
wifi: iwlegacy: add missing mutex protection in il4965_store_tx_power()
wifi: iwlegacy: add missing mutex protection in il3945_store_measurement()
wifi: mac80211: use u64_stats_t with u64_stats_sync properly
wifi: p54: Fix memory leak in p54_beacon_update()
wifi: cfg80211: treat deprecated INDOOR_SP_AP_OLD control value as LPI mode
wifi: rtw88: sdio: Migrate to use sdio specific shutdown function
wifi: rsi: sdio: Migrate to use sdio specific shutdown function
sdio: Provide a bustype shutdown function
wifi: nl80211/cfg80211: support operating as RSTA in PMSR FTM request
wifi: nl80211/cfg80211: add negotiated burst period to FTM result
wifi: nl80211/cfg80211: clarify periodic FTM parameters for non-EDCA based ranging
wifi: nl80211/cfg80211: add new FTM capabilities
wifi: iwlwifi: rename struct iwl_mcc_allowed_ap_type_cmd::offset_map
wifi: iwlwifi: mvm: Remove link_id from time_events
wifi: iwlwifi: mld: change cluster_id type to u8 array
wifi: iwlwifi: support V13 of iwl_lari_config_change_cmd
wifi: iwlwifi: split bios_value_u32 to separate the header
wifi: iwlwifi: uefi: cache the DSM functions
wifi: iwlwifi: acpi: cache the DSM functions
wifi: iwlwifi: mvm: Cleanup MLO code
...
====================
Link: https://patch.msgid.link/20260129110136.176980-39-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fl6_update_dst() is called for every TCP (and others) transmit,
and is a nop for common cases.
Split it in two parts :
1) fl6_update_dst() inline helper, small and fast.
2) __fl6_update_dst() for the exception, out of line.
Small size increase to get better TX performance.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/2 grow/shrink: 8/0 up/down: 296/-125 (171)
Function old new delta
__fl6_update_dst - 104 +104
rawv6_sendmsg 2244 2284 +40
udpv6_sendmsg 3013 3043 +30
tcp_v6_connect 1514 1534 +20
cookie_v6_check 1501 1519 +18
ip6_datagram_dst_update 673 690 +17
inet6_sk_rebuild_header 499 516 +17
inet6_csk_route_socket 507 524 +17
inet6_csk_route_req 343 360 +17
__pfx___fl6_update_dst - 16 +16
__pfx_fl6_update_dst 16 - -16
fl6_update_dst 109 - -109
Total: Before=22570304, After=22570475, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260128185548.3738781-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skb_header_pointer() does not fully validate negative @offset values.
Use skb_header_pointer_careful() instead.
GangMin Kim provided a report and a repro fooling u32_classify():
BUG: KASAN: slab-out-of-bounds in u32_classify+0x1180/0x11b0
net/sched/cls_u32.c:221
Fixes: fbc2e7d9cf49 ("cls_u32: use skb_header_pointer() to dereference data safely")
Reported-by: GangMin Kim <km.kim1503@gmail.com>
Closes: https://lore.kernel.org/netdev/CANn89iJkyUZ=mAzLzC4GdcAgLuPnUoivdLaOs6B9rq5_erj76w@mail.gmail.com/T/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260128141539.3404400-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.19-rc8).
No adjacent changes, conflicts:
drivers/net/ethernet/spacemit/k1_emac.c
2c84959167d64 ("net: spacemit: Check for netif_carrier_ok() in emac_stats_update()")
f66086798f91f ("net: spacemit: Remove broken flow control support")
https://lore.kernel.org/aXjAqZA3iEWD_DGM@sirena.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The xenwatch thread can race with other back-end change notifications
and call xen_9pfs_front_free() twice, hitting the observed general
protection fault due to a double-free. Guard the teardown path so only
one caller can release the front-end state at a time, preventing the
crash.
This is a fix for the following double-free:
[ 27.052347] Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[ 27.052357] CPU: 0 UID: 0 PID: 32 Comm: xenwatch Not tainted 6.18.0-02087-g51ab33fc0a8b-dirty #60 PREEMPT(none)
[ 27.052363] RIP: e030:xen_9pfs_front_free+0x1d/0x150
[ 27.052368] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 41 55 41 54 55 48 89 fd 48 c7 c7 48 d0 92 85 53 e8 cb cb 05 00 48 8b 45 08 48 8b 55 00 <48> 3b 28 0f 85 f9 28 35 fe 48 3b 6a 08 0f 85 ef 28 35 fe 48 89 42
[ 27.052377] RSP: e02b:ffffc9004016fdd0 EFLAGS: 00010246
[ 27.052381] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88800d66e400 RCX: 0000000000000000
[ 27.052385] RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000000 RDI: 0000000000000000
[ 27.052389] RBP: ffff88800a887040 R08: 0000000000000000 R09: 0000000000000000
[ 27.052393] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888009e46b68
[ 27.052397] R13: 0000000000000200 R14: 0000000000000000 R15: ffff88800a887040
[ 27.052404] FS: 0000000000000000(0000) GS:ffff88808ca57000(0000) knlGS:0000000000000000
[ 27.052408] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 27.052412] CR2: 00007f9714004360 CR3: 0000000004834000 CR4: 0000000000050660
[ 27.052418] Call Trace:
[ 27.052420] <TASK>
[ 27.052422] xen_9pfs_front_changed+0x5d5/0x720
[ 27.052426] ? xenbus_otherend_changed+0x72/0x140
[ 27.052430] ? __pfx_xenwatch_thread+0x10/0x10
[ 27.052434] xenwatch_thread+0x94/0x1c0
[ 27.052438] ? __pfx_autoremove_wake_function+0x10/0x10
[ 27.052442] kthread+0xf8/0x240
[ 27.052445] ? __pfx_kthread+0x10/0x10
[ 27.052449] ? __pfx_kthread+0x10/0x10
[ 27.052452] ret_from_fork+0x16b/0x1a0
[ 27.052456] ? __pfx_kthread+0x10/0x10
[ 27.052459] ret_from_fork_asm+0x1a/0x30
[ 27.052463] </TASK>
[ 27.052465] Modules linked in:
[ 27.052471] ---[ end trace 0000000000000000 ]---
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Message-ID: <20260129230348.2390470-1-stefano.stabellini@amd.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
This attempts to proper track outstanding request by using struct ida
and allocating from it in l2cap_get_ident using ida_alloc_range which
would reuse ids as they are free, then upon completion release
the id using ida_free.
This fixes the qualification test case L2CAP/COS/CED/BI-29-C which
attempts to check if the host stack is able to work after 256 attempts
to connect which requires Ident field to use the full range of possible
values in order to pass the test.
Link: https://github.com/bluez/bluez/issues/1829
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
|
|
Fix memory leak in set_ssp_complete() where mgmt_pending_cmd structures
are not freed after being removed from the pending list.
Commit 302a1f674c00 ("Bluetooth: MGMT: Fix possible UAFs") replaced
mgmt_pending_foreach() calls with individual command handling but missed
adding mgmt_pending_free() calls in both error and success paths of
set_ssp_complete(). Other completion functions like set_le_complete()
were fixed correctly in the same commit.
This causes a memory leak of the mgmt_pending_cmd structure and its
associated parameter data for each SSP command that completes.
Add the missing mgmt_pending_free(cmd) calls in both code paths to fix
the memory leak. Also fix the same issue in set_advertising_complete().
Fixes: 302a1f674c00 ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Export hci_discovery_active() so it can be used by bluetooth
drivers built as modules.
Signed-off-by: Linmao Li <lilinmao@kylinos.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This renames the PHY fields in bt_iso_io_qos to PHYs (plural) since it
represents a bitfield where multiple PHYs can be set and make the same
change also to HCI_OP_LE_SET_CIG_PARAMS since both c_phy and p_phy
fields are bitfields.
This also fixes the assumption that hci_evt_le_cis_established PHYs
fields are compatible with bt_iso_io_qos, they are not, the fields in
hci_evt_le_cis_established represent just a single PHY value so they
need to be converted to bitfield when set in bt_iso_io_qos.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This enables client to use setsockopt(BT_PHY) to set the connection
packet type/PHY:
Example setting BT_PHY_BR_1M_1SLOT:
< HCI Command: Change Conne.. (0x01|0x000f) plen 4
Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
Packet type: 0x331e
2-DH1 may not be used
3-DH1 may not be used
DM1 may be used
DH1 may be used
2-DH3 may not be used
3-DH3 may not be used
2-DH5 may not be used
3-DH5 may not be used
> HCI Event: Command Status (0x0f) plen 4
Change Connection Packet Type (0x01|0x000f) ncmd 1
Status: Success (0x00)
> HCI Event: Connection Packet Typ.. (0x1d) plen 5
Status: Success (0x00)
Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
Packet type: 0x331e
2-DH1 may not be used
3-DH1 may not be used
DM1 may be used
DH1 may be used
2-DH3 may not be used
3-DH3 may not be used
2-DH5 may not be used
Example setting BT_PHY_LE_1M_TX and BT_PHY_LE_1M_RX:
< HCI Command: LE Set PHY (0x08|0x0032) plen 7
Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
All PHYs preference: 0x00
TX PHYs preference: 0x01
LE 1M
RX PHYs preference: 0x01
LE 1M
PHY options preference: Reserved (0x0000)
> HCI Event: Command Status (0x0f) plen 4
LE Set PHY (0x08|0x0032) ncmd 1
Status: Success (0x00)
> HCI Event: LE Meta Event (0x3e) plen 6
LE PHY Update Complete (0x0c)
Status: Success (0x00)
Handle: 1 Address: 00:AA:01:01:00:00 (Intel Corporation)
TX PHY: LE 1M (0x01)
RX PHY: LE 1M (0x01)
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
1. Implement LE Event Mask to include events required for
LE Channel Sounding
2. Enable Channel Sounding feature bit in the
LE Host Supported Features command
3. Define HCI command and event structures necessary for
LE Channel Sounding functionality
Signed-off-by: Naga Bhavani Akella <naga.akella@oss.qualcomm.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
While the configurable system parameters allow controlling the SNIFF
mode parameters, they do not include the idle_timeout parameter
responsible for enabling SNIFF mode.
Add the idle_timeout parameter to allow controlling the idle timeout
of BR/EDR connections.
Signed-off-by: Stefan Sørensen <ssorensen@roku.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
The connection link policy is only set when establishing an outgoing
ACL connection causing connection idle modes not to be available on
incoming connections. Move the setting of the link policy to the
creation of the connection so all ACL connection will use the link
policy set on the HCI device.
Signed-off-by: Stefan Sørensen <ssorensen@roku.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
hci_conn_enter_active_mode() uses queue_delayed_work() with the
intention that the work will run after the given timeout. However,
queue_delayed_work() does nothing if the work is already queued, so
depending on the link policy we may end up putting the connection
into idle mode every hdev->idle_timeout ms.
Use mod_delayed_work() instead so the work is queued if not already
queued, and the timeout is updated otherwise.
Signed-off-by: Stefan Sørensen <ssorensen@roku.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|