| Age | Commit message (Collapse) | Author |
|
llc_shdlc_deinit() purges SHDLC skb queues and frees the llc_shdlc
structure while its timers and state machine work may still be active.
Timer callbacks can schedule sm_work, and sm_work accesses SHDLC state
and the skb queues. If teardown happens in parallel with a queued/running
work item, it can lead to UAF and other shutdown races.
Stop all SHDLC timers and cancel sm_work synchronously before purging the
queues and freeing the context.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 4a61cd6687fc ("NFC: Add an shdlc llc module to llc core")
Signed-off-by: Votokina Victoria <Victoria.Votokina@kaspersky.com>
Link: https://patch.msgid.link/20260203113158.2008723-1-Victoria.Votokina@kaspersky.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Yizhou Zhao reported that simply having one RAW socket on protocol
IPPROTO_RAW (255) was dangerous.
socket(AF_INET, SOCK_RAW, 255);
A malicious incoming ICMP packet can set the protocol field to 255
and match this socket, leading to FNHE cache changes.
inner = IP(src="192.168.2.1", dst="8.8.8.8", proto=255)/Raw("TEST")
pkt = IP(src="192.168.1.1", dst="192.168.2.1")/ICMP(type=3, code=4, nexthopmtu=576)/inner
"man 7 raw" states:
A protocol of IPPROTO_RAW implies enabled IP_HDRINCL and is able
to send any IP protocol that is specified in the passed header.
Receiving of all IP protocols via IPPROTO_RAW is not possible
using raw sockets.
Make sure we drop these malicious packets.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/netdev/20251109134600.292125-1-zhaoyz24@mails.tsinghua.edu.cn/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260203192509.682208-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tipc_aead_users_dec() calls rcu_dereference(aead) twice: once to store
in 'tmp' for the NULL check, and again inside the atomic_add_unless()
call.
Use the already-dereferenced 'tmp' pointer consistently, matching the
correct pattern used in tipc_aead_users_inc() and tipc_aead_users_set().
Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication")
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Hodges <hodgesd@meta.com>
Link: https://patch.msgid.link/20260203145621.17399-1-git@danielhodges.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.19-rc9).
No adjacent changes, conflicts:
drivers/net/ethernet/spacemit/k1_emac.c
3125fc1701694 ("net: spacemit: k1-emac: fix jumbo frame support")
f66086798f91f ("net: spacemit: Remove broken flow control support")
https://lore.kernel.org/aYIysFIE9ooavWia@sirena.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently we are registering one dynamic lockdep key for each allocated
qdisc, to avoid false deadlock reports when mirred (or TC eBPF) redirects
packets to another device while the root lock is acquired [1].
Since dynamic keys are a limited resource, we can save them at least for
qdiscs that are not meant to acquire the root lock in the traffic path,
or to carry traffic at all, like:
- clsact
- ingress
- noqueue
Don't register dynamic keys for the above schedulers, so that we hit
MAX_LOCKDEP_KEYS later in our tests.
[1] https://github.com/multipath-tcp/mptcp_net-next/issues/451
Changes in v2:
- change ordering of spin_lock_init() vs. lockdep_register_key()
(Jakub Kicinski)
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/94448f7fa7c4f52d2ce416a4895ec87d456d7417.1770220576.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().
TCP v4 caches the information in inet->cork.fl.u.ip4 instead.
This patch is a first step converting IPv6 to the same strategy:
Before this patch inet6_sk_rebuild_header() only validated/rebuilt
a dst. Automatic variable @fl6 content was lost.
After this patch inet6_sk_rebuild_header() also initializes
inet->cork.fl.u.ip6, which can be reused in the future.
This makes inet6_sk_rebuild_header() very similar to
inet_sk_rebuild_header().
Also remove the EXPORT_SYMBOL_GPL(), inet6_sk_rebuild_header()
is not called from any module.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260204163035.4123817-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Inlining __reqsk_free() is overkill, let's reclaim 2 Kbytes of text.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/4 grow/shrink: 2/14 up/down: 225/-2338 (-2113)
Function old new delta
__reqsk_free - 114 +114
sock_edemux 18 82 +64
inet_csk_listen_start 233 264 +31
__pfx___reqsk_free - 16 +16
__pfx_reqsk_queue_alloc 16 - -16
__pfx_reqsk_free 16 - -16
reqsk_queue_alloc 46 - -46
tcp_req_err 272 177 -95
reqsk_fastopen_remove 348 253 -95
cookie_bpf_check 157 62 -95
cookie_tcp_reqsk_alloc 387 290 -97
cookie_v4_check 1568 1465 -103
reqsk_free 105 - -105
cookie_v6_check 1519 1412 -107
sock_gen_put 187 78 -109
sock_pfree 212 82 -130
tcp_try_fastopen 1818 1683 -135
tcp_v4_rcv 3478 3294 -184
reqsk_put 306 90 -216
tcp_get_cookie_sock 551 318 -233
tcp_v6_rcv 3404 3141 -263
tcp_conn_request 2677 2384 -293
Total: Before=24887415, After=24885302, chg -0.01%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After DCCP removal, this file was not needed any more.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This function belongs to TCP stack, not to net/core/request_sock.c
We get rid of the now empty request_sock.c n the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Only called once from inet_csk_listen_start(), it can be static.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported a kernel BUG in fib6_add_rt2node() when adding an IPv6
route. [0]
Commit f72514b3c569 ("ipv6: clear RA flags when adding a static
route") introduced logic to clear RTF_ADDRCONF from existing routes
when a static route with the same nexthop is added. However, this
causes a problem when the existing route has a gateway.
When RTF_ADDRCONF is cleared from a route that has a gateway, that
route becomes eligible for ECMP, i.e. rt6_qualify_for_ecmp() returns
true. The issue is that this route was never added to the
fib6_siblings list.
This leads to a mismatch between the following counts:
- The sibling count computed by iterating fib6_next chain, which
includes the newly ECMP-eligible route
- The actual siblings in fib6_siblings list, which does not include
that route
When a subsequent ECMP route is added, fib6_add_rt2node() hits
BUG_ON(sibling->fib6_nsiblings != rt->fib6_nsiblings) because the
counts don't match.
Fix this by only clearing RTF_ADDRCONF when the existing route does
not have a gateway. Routes without a gateway cannot qualify for ECMP
anyway (rt6_qualify_for_ecmp() requires fib_nh_gw_family), so clearing
RTF_ADDRCONF on them is safe and matches the original intent of the
commit.
[0]:
kernel BUG at net/ipv6/ip6_fib.c:1217!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6010 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
RIP: 0010:fib6_add_rt2node+0x3433/0x3470 net/ipv6/ip6_fib.c:1217
[...]
Call Trace:
<TASK>
fib6_add+0x8da/0x18a0 net/ipv6/ip6_fib.c:1532
__ip6_ins_rt net/ipv6/route.c:1351 [inline]
ip6_route_add+0xde/0x1b0 net/ipv6/route.c:3946
ipv6_route_ioctl+0x35c/0x480 net/ipv6/route.c:4571
inet6_ioctl+0x219/0x280 net/ipv6/af_inet6.c:577
sock_do_ioctl+0xdc/0x300 net/socket.c:1245
sock_ioctl+0x576/0x790 net/socket.c:1366
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: f72514b3c569 ("ipv6: clear RA flags when adding a static route")
Reported-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cb809def1baaac68ab92
Tested-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260204095837.1285552-1-syoshida@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skb_protocol() is bloated, and forces slow stack canaries in many
fast paths.
Add vlan_get_protocol_offset_inline() which deals with the non-vlan
common cases.
__vlan_get_protocol_offset() is now out of line.
It returns a vlan_type_depth struct to avoid stack canaries in callers.
struct vlan_type_depth {
__be16 type;
u16 depth;
};
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 0/22 up/down: 0/-6320 (-6320)
Function old new delta
vlan_get_protocol_dgram 61 59 -2
__pfx_skb_protocol 16 - -16
__vlan_get_protocol_offset 307 273 -34
tap_get_user 1374 1207 -167
ip_md_tunnel_xmit 1625 1452 -173
tap_sendmsg 940 753 -187
netif_skb_features 1079 866 -213
netem_enqueue 3017 2800 -217
vlan_parse_protocol 271 50 -221
tso_start 567 344 -223
fq_dequeue 1908 1685 -223
skb_network_protocol 434 205 -229
ip6_tnl_xmit 2639 2409 -230
br_dev_queue_push_xmit 474 236 -238
skb_protocol 258 - -258
packet_parse_headers 621 357 -264
__ip6_tnl_rcv 1306 1039 -267
skb_csum_hwoffload_help 515 224 -291
ip_tunnel_xmit 2635 2339 -296
sch_frag_xmit_hook 1582 1233 -349
bpf_skb_ecn_set_ce 868 457 -411
IP6_ECN_decapsulate 1297 768 -529
ip_tunnel_rcv 2121 1489 -632
ipip6_rcv 2572 1922 -650
Total: Before=24892803, After=24886483, chg -0.03%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260204053023.1622775-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
As CAN skbs don't use IP checksums the skb->csum_start variable was used to
store the can-gw CAN frame time-to-live counter together with
skb->ip_summed set to CHECKSUM_UNNECESSARY.
Remove the 'hack' using the skb->csum_start variable and move the content
to can_skb_ext::can_gw_hops of the CAN skb extensions.
The module parameter 'max_hops' has been reduced to a single byte to fit
can_skb_ext::can_gw_hops as the maximum value to be stored is 6.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-6-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This patch removes struct can_skb_priv which was stored at skb->head and
the can_skb_reserve() helper which was used to shift skb->head.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-5-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When routing CAN frames over different CAN interfaces the interface index
skb->iif is overwritten with every single hop. To prevent sending a CAN
frame back to its originating (first) incoming CAN interface another
ifindex variable is needed, which was stored in can_skb_priv::ifindex.
Move the can_skb_priv::ifindex content to can_skb_ext::can_iif.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-3-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
To remove the private CAN bus skb headroom infrastructure 8 bytes need to
be stored in the skb. The skb extensions are a common pattern and an easy
and efficient way to hold private data travelling along with the skb. We
only need the skb_ext_add() and skb_ext_find() functions to allocate and
access CAN specific content as the skb helpers to copy/clone/free skbs
automatically take care of skb extensions and their final removal.
This patch introduces the complete CAN skb extensions infrastructure:
- add struct can_skb_ext in new file include/net/can.h
- add include/net/can.h in MAINTAINERS
- add SKB_EXT_CAN to skbuff.c and skbuff.h
- select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled
- check for existing CAN skb extensions in can_rcv() in af_can.c
- add CAN skb extensions allocation at every skb_alloc() location
- duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops)
- introduce can_skb_ext_add() and can_skb_ext_find() helpers
The patch also corrects an indention issue in the original code from 2018:
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The can_skb_priv::skbcnt variable is used to identify CAN skbs in the RX
path analogue to the skb->hash.
As the skb hash is not filled in CAN skbs move the private skbcnt value to
skb->hash and set skb->sw_hash accordingly. The skb->hash is a value used
for RPS to identify skbs. Use it as intended.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-1-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
nft_map_catchall_activate() has an inverted element activity check
compared to its non-catchall counterpart nft_mapelem_activate() and
compared to what is logically required.
nft_map_catchall_activate() is called from the abort path to re-activate
catchall map elements that were deactivated during a failed transaction.
It should skip elements that are already active (they don't need
re-activation) and process elements that are inactive (they need to be
restored). Instead, the current code does the opposite: it skips inactive
elements and processes active ones.
Compare the non-catchall activate callback, which is correct:
nft_mapelem_activate():
if (nft_set_elem_active(ext, iter->genmask))
return 0; /* skip active, process inactive */
With the buggy catchall version:
nft_map_catchall_activate():
if (!nft_set_elem_active(ext, genmask))
continue; /* skip inactive, process active */
The consequence is that when a DELSET operation is aborted,
nft_setelem_data_activate() is never called for the catchall element.
For NFT_GOTO verdict elements, this means nft_data_hold() is never
called to restore the chain->use reference count. Each abort cycle
permanently decrements chain->use. Once chain->use reaches zero,
DELCHAIN succeeds and frees the chain while catchall verdict elements
still reference it, resulting in a use-after-free.
This is exploitable for local privilege escalation from an unprivileged
user via user namespaces + nftables on distributions that enable
CONFIG_USER_NS and CONFIG_NF_TABLES.
Fix by removing the negation so the check matches nft_mapelem_activate():
skip active elements, process inactive ones.
Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase")
Signed-off-by: Andrew Fasano <andrew.fasano@nist.gov>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Even though a peer may have already received a
non-zero value for "RDS_EXTHDR_NPATHS" from a node in the past,
the current peer may not.
Therefore it is important to initiate another rds_send_ping()
after a re-connect to any peer:
It is unknown at that time if we're still talking to the same
instance of RDS kernel modules on the other side.
Otherwise, the peer may just operate on a single lane
("c_npaths == 0"), not knowing that more lanes are available.
However, if "c_with_sport_idx" is supported,
we also need to check that the connection we accepted on lane#0
meets the proper source port modulo requirement, as we fan out:
Since the exchange of "RDS_EXTHDR_NPATHS" and "RDS_EXTHDR_SPORT_IDX"
is asynchronous, initially we have no choice but to accept an incoming
connection (via "accept") in the first slot ("cp_index == 0")
for backwards compatibility.
But that very connection may have come from a different lane
with "cp_index != 0", since the peer thought that we already understood
and handled "c_with_sport_idx" properly, as indicated by a previous
exchange before a module was reloaded.
In short:
If a module gets reloaded, we recover from that, but do *not*
allow a downgrade to support fewer lanes.
Downgrades would require us to merge messages from separate lanes,
which is rather tricky with the current RDS design.
Each lane has its own sequence number space and all messages
would need to be re-sequenced as we merge, all while
handling "RDS_FLAG_RETRANSMITTED" and "cp_retrans" properly.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-9-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Instead of just blocking the sender until "c_npaths" is known
(it gets updated upon the receipt of a MPRDS PONG message),
simply use the first lane (cp_index#0).
But just using the first lane isn't enough.
As soon as we enqueue messages on a different lane, we'd run the risk
of out-of-order delivery of RDS messages.
Earlier messages enqueued on "cp_index == 0" could be delivered later
than more recent messages enqueued on "cp_index > 0", mostly because of
possible head of line blocking issues causing the first lane to be
slower.
To avoid that, we simply take a snapshot of "cp_next_tx_seq" at the
time we're about to fan-out to more lanes.
Then we delay the transmission of messages enqueued on other lanes
with "cp_index > 0" until cp_index#0 caught up with the delivery of
new messages (from "cp_send_queue") as well as in-flight
messages (from "cp_retrans") that haven't been acknowledged yet
by the receiver.
We also add a new counter "mprds_catchup_tx0_retries" to keep track
of how many times "rds_send_xmit" had to suspend activities,
because it was waiting for the first lane to catch up.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-8-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Quick clean up to avoid checkpatch errors when adding members to
this struct (Prefer kernel type 'u64' over 'uint64_t').
No functional changes added.
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-7-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When canceling the reconnect worker, care must be taken to reset the
reconnect-pending bit. If the reconnect worker has not yet been
scheduled before it is canceled, the reconnect-pending bit will stay
on forever.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-6-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In cases where the server (the node with the higher IP-address)
in an RDS/TCP connection is overwhelmed it is possible that the
socket that was just accepted is chock-full of messages, up to
the limit of what the socket receive buffer permits.
Subsequently, "rds_tcp_data_ready" won't be called anymore,
because there is no more space to receive additional messages.
Nor was it called prior to the point of calling "rds_tcp_set_callbacks",
because the "sk_data_ready" pointer didn't even point to
"rds_tcp_data_ready" yet.
We fix this by simply kick-starting the receive-worker
for all cases where the socket state is neither
"TCP_CLOSE_WAIT" nor "TCP_CLOSE".
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-5-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged by the
TCP stack of a peer, rds_tcp_write_space() goes on to discard
prior messages from the send queue.
Which is fine, for as long as the receiver never throws any messages
away.
The dequeuing of messages in RDS/TCP is done either from the
"sk_data_ready" callback pointing to rds_tcp_data_ready()
(the most common case), or from the receive worker pointing
to rds_tcp_recv_path() which is called for as long as the
connection is "RDS_CONN_UP".
However, as soon as rds_conn_path_drop() is called for whatever reason,
including "DR_USER_RESET", "cp_state" transitions to "RDS_CONN_ERROR",
and rds_tcp_restore_callbacks() ends up restoring the callbacks
and thereby disabling message receipt.
So messages already acknowledged to the sender were dropped.
Furthermore, the "->shutdown" callback was always called
with an invalid parameter ("RCV_SHUTDOWN | SEND_SHUTDOWN == 3"),
instead of the correct pre-increment value ("SHUT_RDWR == 2").
inet_shutdown() returns "-EINVAL" in such cases, rendering
this call a NOOP.
So we change rds_tcp_conn_path_shutdown() to do the proper
"->shutdown(SHUT_WR)" call in order to signal EOF to the peer
and make it transition to "TCP_CLOSE_WAIT" (RFC 793).
This should make the peer also enter rds_tcp_conn_path_shutdown()
and do the same.
This allows us to dequeue all messages already received
and acknowledged to the peer.
We do so, until we know that the receive queue no longer has data
(skb_queue_empty()) and that we couldn't have any data
in flight anymore, because the socket transitioned to
any of the states "CLOSING", "TIME_WAIT", "CLOSE_WAIT",
"LAST_ACK", or "CLOSE" (RFC 793).
However, if we do just that, we suddenly see duplicate RDS
messages being delivered to the application.
So what gives?
Turns out that with MPRDS and its multitude of backend connections,
retransmitted messages ("RDS_FLAG_RETRANSMITTED") can outrace
the dequeuing of their original counterparts.
And the duplicate check implemented in rds_recv_local() only
discards duplicates if flag "RDS_FLAG_RETRANSMITTED" is set.
Rather curious, because a duplicate is a duplicate; it shouldn't
matter which copy is looked at and delivered first.
To avoid this entire situation, we simply make the sender discard
messages from the send-queue right from within
rds_tcp_conn_path_shutdown(). Just like rds_tcp_write_space() would
have done, were it called in time or still called.
This makes sure that we no longer have messages that we know
the receiver already dequeued sitting in our send-queue,
and therefore avoid the entire "RDS_FLAG_RETRANSMITTED" fiasco.
Now we got rid of the duplicate RDS message delivery, but we
still run into cases where RDS messages are dropped.
This time it is due to the delayed setting of the socket-callbacks
in rds_tcp_accept_one() via either rds_tcp_reset_callbacks()
or rds_tcp_set_callbacks().
By the time rds_tcp_accept_one() gets there, the socket
may already have transitioned into state "TCP_CLOSE_WAIT",
but rds_tcp_state_change() was never called.
Subsequently, "->shutdown(SHUT_WR)" did not happen either.
So the peer ends up getting stuck in state "TCP_FIN_WAIT2".
We fix that by checking for states "TCP_CLOSE_WAIT", "TCP_LAST_ACK",
or "TCP_CLOSE" and drop the freshly accepted socket in that case.
This problem is observable by running "rds-stress --reset"
frequently on either of the two sides of a RDS connection,
or both while other "rds-stress" processes are exchanging data.
Those "rds-stress" processes reported out-of-sequence
errors, with the expected sequence number being smaller
than the one actually received (due to the dropped messages).
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Upon "sendmsg", RDS/TCP selects a backend connection based
on a hash calculated from the source-port ("RDS_MPATH_HASH").
However, "rds_tcp_accept_one" accepts connections
in the order they arrive, which is non-deterministic.
Therefore the mapping of the sender's "cp->cp_index"
to that of the receiver changes if the backend
connections are dropped and reconnected.
However, connection state that's preserved across reconnects
(e.g. "cp_next_rx_seq") relies on that sender<->receiver
mapping to never change.
So we make sure that client and server of the TCP connection
have the exact same "cp->cp_index" across reconnects by
encoding "cp->cp_index" in the lower three bits of the
client's TCP source port.
A new extension "RDS_EXTHDR_SPORT_IDX" is introduced,
that allows the server to tell the difference between
clients that do the "cp->cp_index" encoding, and
legacy clients that pick source ports randomly.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce a new extension header type RDSV3_EXTHDR_RDMA_BYTES for
an RDMA initiator to exchange rdma byte counts to its target.
Currently, RDMA operations cannot precisely account how many bytes a
peer just transferred via RDMA, which limits per-connection statistics
and future policy (e.g., monitoring or rate/cgroup accounting of RDMA
traffic).
In this patch we expand rds_message_add_extension to accept multiple
extensions, and add new flag to RDS header: RDS_FLAG_EXTHDR_EXTENSION,
along with a new extension to RDS header: rds_ext_header_rdma_bytes.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After 076433bd78d7 ("net_sched: sch_fq: add fast path
for mostly idle qdisc") we need to remove one unlikely()
because q->internal holds all the fast path packets.
skb = fq_peek(&q->internal);
if (unlikely(skb)) {
q->internal.qlen--;
Calling INET_ECN_set_ce() is very unlikely.
These changes allow fq_dequeue_skb() to be (auto)inlined,
thus making fq_dequeue() faster.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 2/2 grow/shrink: 0/1 up/down: 283/-269 (14)
Function old new delta
INET_ECN_set_ce - 267 +267
__pfx_INET_ECN_set_ce - 16 +16
__pfx_fq_dequeue_skb 16 - -16
fq_dequeue_skb 103 - -103
fq_dequeue 1685 1535 -150
Total: Before=24886569, After=24886583, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260203214716.880853-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix numerous (many) kernel-doc warnings in iucv.[ch]:
- convert function documentation comments to a common (kernel-doc) look,
even for static functions (without "/**")
- use matching parameter and parameter description names
- use better wording in function descriptions (Jakub & AI)
- remove duplicate kernel-doc comments from the header file (Jakub)
Examples:
Warning: include/net/iucv/iucv.h:210 missing initial short description
on line: * iucv_unregister
Warning: include/net/iucv/iucv.h:216 function parameter 'handle' not
described in 'iucv_unregister'
Warning: include/net/iucv/iucv.h:467 function parameter 'answer' not
described in 'iucv_message_send2way'
Warning: net/iucv/iucv.c:727 missing initial short description on line:
* iucv_cleanup_queue
Build-tested with both "make htmldocs" and "make ARCH=s390 defconfig all".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20260203075248.1177869-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_check_space() is fat and not inlined.
Move its slow path in (out of line) __tcp_check_space()
and make tcp_check_space() an inline function for better TCP performance.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/2 grow/shrink: 4/0 up/down: 708/-582 (126)
Function old new delta
__tcp_check_space - 521 +521
tcp_rcv_established 1860 1916 +56
tcp_rcv_state_process 3342 3384 +42
tcp_event_new_data_sent 248 286 +38
tcp_data_snd_check 71 106 +35
__pfx___tcp_check_space - 16 +16
__pfx_tcp_check_space 16 - -16
tcp_check_space 566 - -566
Total: Before=24896373, After=24896499, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203050932.3522221-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_rbtree_insert() is primarily used from tcp_output.c
In tcp_input.c, only (slow path) tcp_collapse() uses it.
Move it to tcp_output.c to allow its (auto)inlining to improve
TCP tx fast path.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 4/1 up/down: 445/-115 (330)
Function old new delta
tcp_connect 4277 4478 +201
tcp_event_new_data_sent 162 248 +86
tcp_send_synack 780 862 +82
tcp_fragment 1185 1261 +76
tcp_collapse 1524 1409 -115
Total: Before=24896043, After=24896373, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203045110.3499713-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We trust MAX_TCP_HEADER to be large enough.
Using the inlined version of skb_push() trades 8 bytes
of text for better performance of TCP TX fast path.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/0 up/down: 8/0 (8)
Function old new delta
__tcp_transmit_skb 3181 3189 +8
Total: Before=24896035, After=24896043, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203044226.3489941-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Some more changes, including pulls from drivers:
- ath drivers: small features/cleanups
- rtw drivers: mostly refactoring for rtw89 RTL8922DE support
- mac80211: use hrtimers for CAC to avoid too long delays
- cfg80211/mac80211: some initial UHR (Wi-Fi 8) support
* tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (59 commits)
wifi: brcmsmac: phy: Remove unreachable error handling code
wifi: mac80211: Add eMLSR/eMLMR action frame parsing support
wifi: mac80211: add initial UHR support
wifi: cfg80211: add initial UHR support
wifi: ieee80211: add some initial UHR definitions
wifi: mac80211: use wiphy_hrtimer_work for CAC timeout
wifi: mac80211: correct ieee80211-{s1g/eht}.h include guard comments
wifi: ath12k: clear stale link mapping of ahvif->links_map
wifi: ath12k: Add support TX hardware queue stats
wifi: ath12k: Add support RX PDEV stats
wifi: ath12k: Fix index decrement when array_len is zero
wifi: ath12k: support OBSS PD configuration for AP mode
wifi: ath12k: add WMI support for spatial reuse parameter configuration
dt-bindings: net: wireless: ath11k-pci: deprecate 'firmware-name' property
wifi: ath11k: add usecase firmware handling based on device compatible
wifi: ath10k: sdio: add missing lock protection in ath10k_sdio_fw_crashed_dump()
wifi: ath10k: fix lock protection in ath10k_wmi_event_peer_sta_ps_state_chg()
wifi: ath10k: snoc: support powering on the device via pwrseq
wifi: rtw89: pci: warn if SPS OCP happens for RTL8922DE
wifi: rtw89: pci: restore LDO setting after device resume
...
====================
Link: https://patch.msgid.link/20260204121143.181112-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are two:
min_t(int, xxx, mptcp_wnd_end(msk) - msk->snd_nxt);
Both mptcp_wnd_end(msk) and msk->snd_nxt are u64, their difference
(aka the window size) might be limited to 32 bits - but that isn't
knowable from this code.
So checks being added to min_t() detect the potential discard of
significant bits.
Provided the 'avail_size' and return of mptcp_check_allowed_size()
are changed to an unsigned type (size_t matches the type the caller
uses) both min_t() can be changed to min().
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
[ wrapped too long lines when declaring mptcp_check_allowed_size() ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-6-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The MPTCP Netlink specs describe the 'flags' as a u32 type. Internally,
a u8 type was used.
Using a u8 is currently fine, because only the 5 first bits are used.
But there is also no reason not to be aligns with the specs, and
to stick to a u8. Especially because there is a whole of 3 bytes after
in both mptcp_pm_local and mptcp_pm_addr_entry structures.
Also, setting it to a u32 will allow future flags, just in case.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-5-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Similar to tcp, provide a new tracepoint to better understand
mptcp_rcv_space_adjust() behavior, which presents many artifacts.
Note that the used format string is so long that I preferred
wrap it, contrary to guidance for quoted strings.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-4-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
MPTCP uses several calls of the mptcp_rcv_space_init() helper to
initialize the receive space, with a catch-up call in
mptcp_rcv_space_adjust().
Drop all the other strictly not needed invocations and move constant
fields initialization at socket init/reset time.
This removes a bit of complexity from mptcp DRS code. No functional
changes intended.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-3-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(),
using the provided subflow stamp. Such helper is invoked in several
places; for passive sockets, space init happened at clone time.
In such scenario, MPTCP ends-up accesses the subflow stamp before
its initialization, leading to quite randomic timing for the first
receive buffer auto-tune event, as the timestamp for newly created
subflow is not refreshed there.
Fix the issue moving the stamp initialization out of the mentioned helper,
at the data transfer start, and always using a fresh timestamp.
Fixes: 013e3179dbd2 ("mptcp: fix rcv space initialization")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-2-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
MPTCP-level OoOs are physiological when multiple subflows are active
concurrently and will not cause retransmissions nor are caused by
drops.
Accounting for them in mptcp_rcvbuf_grow() causes the rcvbuf slowly
drifting towards tcp_rmem[2].
Remove such accounting. Note that subflows will still account for TCP-level
OoO when the MPTCP-level rcvbuf is propagated.
This also closes a subtle and very unlikely race condition with rcvspace
init; active sockets with user-space holding the msk-level socket lock,
could complete such initialization in the receive callback, after that the
first OoO data reaches the rcvbuf and potentially triggering a divide by
zero Oops.
Fixes: e118cdc34dd1 ("mptcp: rcvbuf auto-tuning improvement")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-1-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The gcc-16.0.1 snapshot produces a false-positive warning that turns
into a build failure with CONFIG_WERROR:
In file included from arch/x86/include/asm/string.h:6,
from net/vmw_vsock/vmci_transport.c:10:
In function 'vmci_transport_packet_init',
inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2:
arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull]
150 | #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
| ^~~~~~
arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy'
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
| ^~~~~~
This seems relatively harmless, and it so far the only instance of this
warning I have found. The __vmci_transport_send_control_pkt function
is called either with wait=NULL or with one of the type values that
pass 'wait' into memcpy() here, but not from the same caller.
Replacing the memcpy with a struct assignment is otherwise the same
but avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com>
Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
AF_UNIX does not use TCP_NEW_SYN_RECV nor TCP_TIME_WAIT and
checking sk->sk_family is sufficient.
Let's remove sk_fullsock() and use sk_is_unix() in
bpf_skc_to_unix_sock().
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260203213442.682838-3-kuniyu@google.com
|
|
The udp GRO complete stage assumes that all the packets inserted the RX
have the `encapsulation` flag zeroed. Such assumption is not true, as a
few H/W NICs can set such flag when H/W offloading the checksum for
an UDP encapsulated traffic, the tun driver can inject GSO packets with
UDP encapsulation and the problematic layout can also be created via
a veth based setup.
Due to the above, in the problematic scenarios, udp4_gro_complete() uses
the wrong network offset (inner instead of outer) to compute the outer
UDP header pseudo checksum, leading to csum validation errors later on
in packet processing.
Address the issue always clearing the encapsulation flag at GRO completion
time. Such flag will be set again as needed for encapsulated packets by
udp_gro_complete().
Fixes: 5ef31ea5d053 ("net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb")
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/562638dbebb3b15424220e26a180274b387e2a88.1770032084.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Yin Fengwei reported an RCU stall in ptype_seq_show() and provided
a patch.
Real issue is that ptype_seq_next() and ptype_seq_show() violate
RCU rules.
ptype_seq_show() runs under rcu_read_lock(), and reads pt->dev
to get device name without any barrier.
At the same time, concurrent writers can remove a packet_type structure
(which is correctly freed after an RCU grace period) and clear pt->dev
without an RCU grace period.
Define ptype_iter_state to carry a dev pointer along seq_net_private:
struct ptype_iter_state {
struct seq_net_private p;
struct net_device *dev; // added in this patch
};
We need to record the device pointer in ptype_get_idx() and
ptype_seq_next() so that ptype_seq_show() is safe against
concurrent pt->dev changes.
We also need to add full RCU protection in ptype_seq_next().
(Missing READ_ONCE() when reading list.next values)
Many thanks to Dong Chenchen for providing a repro.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1d10f8a1f40b ("net-procfs: show net devices bound packet types")
Fixes: c353e8983e0d ("net: introduce per netns packet chains")
Reported-by: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Reported-by: Dong Chenchen <dongchenchen2@huawei.com>
Closes: https://lore.kernel.org/netdev/CANn89iKRRKPnWjJmb-_3a=sq+9h6DvTQM4DBZHT5ZRGPMzQaiA@mail.gmail.com/T/#m7b80b9fc9b9267f90e0b7aad557595f686f9c50d
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Tested-by: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Link: https://patch.msgid.link/20260202205217.2881198-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace sprintf with sysfs_emit in sysfs show() methods as outlined in
Documentation/filesystems/sysfs.rst.
sysfs_emit is preferred to sprintf in sysfs show() methods as it is safer
with buffer handling.
Signed-off-by: David Corvaglia <david@corvaglia.dev>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/0100019c1fc2bcc3-bc9ca2f1-22d7-4250-8441-91e4af57117b-000000@email.amazonses.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cleanup and preparation to simplify planned future changes.
Link: https://lkml.kernel.org/r/aXY_4NSP094-Cf-2@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Christan König <christian.koenig@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
RPS cpumask can be overriden through sysfs/syctl. The boot defined
isolated CPUs are then excluded from that cpumask.
However HK_TYPE_DOMAIN will soon integrate cpuset isolated
CPUs updates and the RPS infrastructure needs more thoughts to be able
to propagate such changes and synchronize against them.
Keep handling only what was passed through "isolcpus=" for now.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: netdev@vger.kernel.org
|
|
Enable Accurate ECN negotiation and request for incoming and
outgoing connection by setting sysctl_tcp_ecn:
+==============+===========================================+
| | Highest ECN variant (Accurate ECN, ECN, |
| tcp_ecn | or no ECN) to be negotiated & requested |
| +---------------------+---------------------+
| | Incoming connection | Outgoing connection |
+==============+=====================+=====================+
| 0 | No ECN | No ECN |
| 1 | ECN | ECN |
| 2 | ECN | No ECN |
+--------------+---------------------+---------------------+
| 3 | Accurate ECN | Accurate ECN |
| 4 | Accurate ECN | ECN |
| 5 | Accurate ECN | No ECN |
+==============+=====================+=====================+
Refer Documentation/networking/ip-sysctl.rst for more details.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-15-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add 2-bit tcpi_ecn_mode feild within tcp_info to indicate which ECN
mode is negotiated: ECN_MODE_DISABLED, ECN_MODE_RFC3168, ECN_MODE_ACCECN,
or ECN_MODE_PENDING. This is done by utilizing available bits from
tcpi_accecn_opt_seen (reduced from 16 bits to 2 bits) and
tcpi_accecn_fail_mode (reduced from 16 bits to 4 bits).
Also, an extra 24-bit tcpi_options2 field is identified to represent
newer options and connection features, as all 8 bits of tcpi_options
field have been used.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-14-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Detect spurious retransmission of a previously sent ACK carrying the
AccECN option after the second retransmission. Since this might be caused
by the middlebox dropping ACK with options it does not recognize, disable
the sending of the AccECN option in all subsequent ACKs. This patch
follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field
(accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was
sent with duplicate SACK info.
Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl:
(TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and
persistently sends AccECN option once it fits into TCP option space.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Based on specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Based on Section 3.1.5 of AccECN spec (RFC9768), a TCP Server in
AccECN mode MUST NOT set ECT on any packet for the rest of the connection,
if it has received or sent at least one valid SYN or Acceptable SYN/ACK
with (AE,CWR,ECE) = (0,0,0) during the handshake.
In addition, a host in AccECN mode that is feeding back the IP-ECN
field on a SYN or SYN/ACK MUST feed back the IP-ECN field on the
latest valid SYN or acceptable SYN/ACK to arrive.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-11-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
For Accurate ECN, the first SYN/ACK sent by the TCP server shall set
the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the
capability negotiation. However, if the TCP server needs to retransmit
such a SYN/ACK (for example, because it did not receive an ACK
acknowledging its SYN/ACK, or received a second SYN requesting AccECN
support), the TCP server retransmits the SYN/ACK without the AccECN
option. This is because the SYN/ACK may be lost due to congestion, or a
middlebox may block the AccECN option. Furthermore, if this retransmission
also times out, to expedite connection establishment, the TCP server
should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the
AccECN option, while maintaining AccECN feedback mode.
This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-10-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|