| Age | Commit message (Collapse) | Author |
|
Similarly to the issue from the previous patch, neigh_timer_handler() also
updates the neighbor separately from formatting and sending the netlink
notification message. We have not seen reports to the effect of this
causing trouble, but in theory, the same sort of issues could have come up:
neigh_timer_handler() would make changes as necessary, but before
formatting and sending a notification, is interrupted before sending by
another thread, which makes a parallel change and sends its own message.
The message send that is prompted by an earlier change thus contains
information that does not reflect the change having been made.
To solve this, the netlink notification needs to be in the same critical
section that updates the neighbor. The critical section is ended by the
neigh_probe() call which drops the lock before calling solicit. Stretching
the critical section over the solicit call is problematic, because that can
then involved all sorts of forwarding callbacks. Therefore, like in the
previous patch, split the netlink notification away from the internal one
and move it ahead of the probe call.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/e440118511cbdbe1d88eb0d71c9047116feb96e0.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As noted in a previous patch, one race remains in the current code. A
kernel thread might interrupt a userspace thread after the change is done,
but before formatting and sending the message. Then what we would see is
two messages with the same contents:
userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
RTNL REACHABLE sent
<--------
neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
RTNL REACHABLE sent again
The solution is to send the netlink message inside the critical section
where the neighbor is changed, so that it reflects the notified-upon
neighbor state.
To that end, in __neigh_update(), move the current neigh_notify() call up
to said critical section, and convert it to __neigh_notify(), because the
lock is held. This motion crosses calls to neigh_update_managed_list(),
neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which
potentially unlock and give an opportunity for the above race.
This also crosses a call to neigh_update_process_arp_queue() which calls
neigh->output(), which might be neigh_resolve_output() calls
neigh_event_send() calls neigh_event_send_probe() calls
__neigh_event_send() calls neigh_probe(), which touches neigh->probes,
an update which will now not be visible in the notification.
However, there is indication that there is no promise that these changes
will be accurately projected to notifications: fib6_table_lookup()
indirectly calls route.c's find_match() calls rt6_probe(), which looks up a
neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0,
but neither this nor the caller seems to send a notification.
Additionally, the neighbor object that the neigh_probe() mentioned above is
called on, might be the alternative neighbor looked up for the ARP queue
packet destination. If that is the case, the changed value of n1->probes is
not notified anywhere.
So at least in some circumstances, the reported number of probes needs to
be assumed to change without notification.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ceb44995498eb52375cb2d46c3245bdb9e74b355.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The netlink message needs to be send inside the critical section where the
neighbor is changed, so that it reflects the notified-upon neighbor state.
On the other hand, there is no such need in case of notifier chain: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock.
This requires that the netlink notification be done before the internal
notifier chain message is sent. That is safe to do, because the current
listeners, as well as __neigh_notify(), only read the updated neighbor
fields, and never modify them. (Apart from locking.)
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/f3ef74d5460f14c4d102b8a5857d4a6624da9a5a.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The obvious idea behind the helper is to keep together the two bits that
should be done either both or neither: the internal notifier chain message,
and the netlink notification.
To make sure that the notification sent reflects the change being made, the
netlink message needs to be send inside the critical section where the
neighbor is changed. But for the notifier chain, there is no such need: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock. Therefore these two items have each different locking needs.
Now we could unlock inside the helper, but I find that error prone, and the
fact that the notification is conditional in the first place does not help
to make the call site obvious.
So in this patch, the helper is instead removed and the body, which is just
these two calls, inlined. That way we can use each notifier independently.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/e65dce5882bc6f4aa2530b8a4877d0e003071a1a.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ARP queue processing unlocks the neighbor lock, which can allow another
thread to asynchronously perform a neighbor update and send an out of order
notification. Therefore this needs to be done after the notification is
sent.
Move it just before the end of the critical section. Since
neigh_update_process_arp_queue() unlocks, it does not form a part of the
critical section anymore but it can benefit from the lock being taken. The
intention is to eventually do the RTNL notification before this call.
This motion crosses a call to neigh_update_is_router(), which should not
influence processing of the ARP queue.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/9ea7159e71430ebdc837ebcc880a76b7e82e52a4.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In order to make manipulation with this bit of code clearer, extract it
to a helper function, neigh_update_process_arp_queue().
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/8b0fa0abe2cf0e24484903f5436fe0ac64163057.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Andy Roulin has described an issue with the current neighbor notification
scheme as follows. This was also presented publicly at the link below.
neigh_update sends a rtnl notification if an update, e.g.,
nud_state change, was done but there is no guarantee of
ordering of the rtnl notifications. Consider the following
scenario:
userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = STALE
read_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = REACHABLE
read_unlock_bh(n->lock)
rtnl_nofify
RTNL REACHABLE sent
<--------
rtnl_notify
RTNL STALE sent
In this scenario, the kernel neigh is updated first to STALE and
then REACHABLE but the netlink notifications are sent out of order,
first REACHABLE and then STALE.
The solution is to send the netlink message inside the same critical
section that formats the message. That way both the contents and ordering
of the message reflect the same state, and we cannot see the abovementioned
out-of-order delivery.
Even with this patch, a remaining issue that the contents of the message
may not reflect the changes made to the neighbor. A kernel thread might
still interrupt a userspace thread after the change is done, but before
formatting and sending the message. Then what we would see is two messages
with the same contents. The following patches will attempt to address that
issue.
To support those future patches, convert __neigh_notify() to a helper that
assumes that the neighbor lock is already taken by having it call
__neigh_fill_info() instead of neigh_fill_info(). Add a new helper,
neigh_notify(), which takes the lock before calling __neigh_notify().
Migrate all callers to use the latter.
Link: https://lore.kernel.org/netdev/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com/
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/4b4368dcc5f5a7e407009cb6c36b69cfb5282864.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The netlink message needs to be formatted and sent inside the critical
section where the neighbor is changed, so that it reflects the
notified-upon neighbor state. Because it will happen inside an already
existing critical section, it has to assume that the neighbor lock is held.
Add a helper __neigh_fill_info(), which is like neigh_fill_info(), but
makes this assumption. Convert neigh_fill_info() to a wrapper around this
new helper.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/7ec20113d5d809200e3534d3ed8f0004514914b8.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
idev->mr_maxdelay is read and written locklessly,
add READ_ONCE()/WRITE_ONCE() annotations.
While we are at it, make this field an u32.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
NETDEV_RSS_KEY_LEN has been set to 52 bytes in 2014, until now.
Jakub suggested we bump the size to 128 bytes or more.
Some drivers (like idpf) were already working around the core limit.
Since this change might cause some issues in admin scripts,
bump it directly to 256 in one go.
tjbp26:~# cat /proc/sys/net/core/netdev_rss_key | wc -c
768
tjbp26:~# ethtool -x eth1
RX flow hash indirection table for eth1 with 32 RX ring(s):
...
RSS hash key:
fe:16:5b:2f:93:85:c2:c9:c1:ef:bd:60:c6:e0:2b:99:4d:bf:b7:14:c8:1e:8d:cb:31:17:51:da:55:eb:91:d9:9e:f9:89:9b:44:a1:dc:08:72:3a:b3:d6:31:86:9a:fe:02:3a:0d:eb:a1:7c:f5:a3:51:3b:08:56:c9:3f:71:69:01:ba:70:38
RSS hash function:
toeplitz: on
xor: off
crc32: off
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20260122075206.504ec591@kernel.org/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260122190349.2771064-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
These helpers are used in network fast paths.
Only call out-of-line helpers for netmem case.
We might consider inlining __get_netmem() and __put_netmem()
in the future.
$ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4
add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968)
Function old new delta
pskb_carve 1669 1894 +225
gro_pull_from_frag0 - 206 +206
get_page 190 380 +190
skb_segment 3561 3747 +186
put_page 595 765 +170
skb_copy_ubufs 1683 1822 +139
__pskb_trim_head 276 401 +125
__pskb_copy_fclone 734 858 +124
skb_zerocopy 1092 1215 +123
pskb_expand_head 892 1008 +116
skb_split 828 940 +112
skb_release_data 297 409 +112
___pskb_trim 829 941 +112
__skb_zcopy_downgrade_managed 120 226 +106
tcp_clone_payload 530 634 +104
esp_ssg_unref 191 294 +103
dev_gro_receive 1464 1514 +50
__put_netmem - 41 +41
__get_netmem - 41 +41
skb_shift 1139 1175 +36
skb_try_coalesce 681 714 +33
__pfx_put_page 112 144 +32
__pfx_get_page 32 64 +32
__pskb_pull_tail 1137 1168 +31
veth_xdp_get 250 267 +17
__pfx_gro_pull_from_frag0 - 16 +16
__pfx___put_netmem - 16 +16
__pfx___get_netmem - 16 +16
__pfx_put_netmem 16 - -16
__pfx_gro_try_pull_from_frag0 16 - -16
__pfx_get_netmem 16 - -16
put_netmem 114 - -114
get_netmem 130 - -130
napi_gro_frags 929 771 -158
gro_try_pull_from_frag0 196 - -196
Total: Before=22565857, After=22567825, chg +0.01%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
1) Inline this small helper to reduce code size and decrease cpu costs.
2) Constify its argument.
3) Move it to include/net/netmem.h, as a prereq for the following patch.
$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/4 up/down: 0/-158 (-158)
Function old new delta
validate_xmit_skb 866 857 -9
__pfx_net_is_devmem_iov 16 - -16
net_is_devmem_iov 22 - -22
get_netmem 152 130 -22
put_netmem 140 114 -26
tcp_recvmsg_locked 3860 3797 -63
Total: Before=22566015, After=22565857, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace the BUG_ON() which never fired with a DEBUG_NET_WARN_ON_ONCE()
$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/2 grow/shrink: 1/1 up/down: 370/-254 (116)
Function old new delta
gro_try_pull_from_frag0 - 196 +196
napi_gro_frags 771 929 +158
__pfx_gro_try_pull_from_frag0 - 16 +16
__pfx_gro_pull_from_frag0 16 - -16
dev_gro_receive 1514 1464 -50
gro_pull_from_frag0 188 - -188
Total: Before=22565899, After=22566015, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When replying to a ICMPv6 echo request that comes from localhost address
the right output ifindex is 1 (lo) and not rt6i_idev dev index. Use the
skb device ifindex instead. This fixes pinging to a local address from
localhost source address.
$ ping6 -I ::1 2001:1:1::2 -c 3
PING 2001:1:1::2 (2001:1:1::2) from ::1 : 56 data bytes
64 bytes from 2001:1:1::2: icmp_seq=1 ttl=64 time=0.037 ms
64 bytes from 2001:1:1::2: icmp_seq=2 ttl=64 time=0.069 ms
64 bytes from 2001:1:1::2: icmp_seq=3 ttl=64 time=0.122 ms
2001:1:1::2 ping statistics
3 packets transmitted, 3 received, 0% packet loss, time 2032ms
rtt min/avg/max/mdev = 0.037/0.076/0.122/0.035 ms
Fixes: 1b70d792cf67 ("ipv6: Use rt6i_idev index for echo replies to a local address")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260121194409.6749-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The fsession is something that similar to kprobe session. It allow to
attach a single BPF program to both the entry and the exit of the target
functions.
Introduce the struct bpf_fsession_link, which allows to add the link to
both the fentry and fexit progs_hlist of the trampoline.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Co-developed-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260124062008.8657-2-dongml2@chinatelecom.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged
by the TCP stack of a peer, "rds_tcp_write_space()" goes on
to discard prior messages from the send queue.
Which is fine, for as long as the receiver never throws any messages away.
Unfortunately, that is *not* the case since the introduction of MPRDS:
commit 1a0e100fb2c96 "RDS: TCP: Enable multipath RDS for TCP"
A new function "rds_tcp_accept_one_path" was introduced,
which is entitled to return "NULL", if no connection path is currently
available.
Unfortunately, this happens after the "->accept()" call, and the new socket
often already contains messages, since the peer already transitioned
to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED".
That's also the case after this [1]:
commit 1a0e100fb2c96 "RDS: TCP: Force every connection to be initiated by
numerically smaller IP address"
which tried to address the situation of pending data by only transitioning
connections from a smaller IP address to "RDS_CONN_UP".
But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS"
handshake has not occurred yet, and therefore we're working with
"c_npaths <= 1", "c_conn[0]" may be in a state distinct from
"RDS_CONN_DOWN", and therefore all messages on the just accepted socket
will be tossed away.
This fix changes "rds_tcp_accept_one":
* If connected from a peer with a larger IP address, the new socket
will continue to get closed right away.
With commit [1] above, there should not be any messages
in the socket receive buffer, since the peer never transitioned
to "RDS_CONN_UP".
Therefore it should be okay to not make any efforts to dispatch
the socket receive buffer.
* If connected from a peer with a smaller IP address,
we call "rds_tcp_accept_one_path" to find a free slot/"path".
If found, business goes on as usual.
If none was found, we save/stash the newly accepted socket
into "rds_tcp_accepted_sock", in order to not lose any
messages that may have arrived already.
We then return from "rds_tcp_accept_one" with "-ENOBUFS".
Later on, when a slot/"path" does become available again
(e.g. state transitioned to "RDS_CONN_DOWN",
or HS extension header was received with "c_npaths > 1")
we call "rds_tcp_conn_slots_available" that simply re-issues
a "rds_tcp_accept_one_path" worker-callback and picks
up the new socket from "rds_tcp_accepted_sock", and thereby
continuing where it left with "-ENOBUFS" last time.
Since a new slot has become available, those messages
won't be lost, since processing proceeds as if that slot
had been available the first time around.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
RDS connections carry a state "rds_conn_path::cp_state"
and transitions from one state to another and are conditional
upon an expected state: "rds_conn_path_transition."
There is one exception to this conditionality, which is
"RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop"
regardless of what state the condition is currently in.
But as soon as a connection enters state "RDS_CONN_ERROR",
the connection handling code expects it to go through the
shutdown-path.
The RDS/TCP multipath changes added a shortcut out of
"RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING"
via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change").
A subsequent "rds_tcp_reset_callbacks" can then transition
the state to "RDS_CONN_RESETTING" with a shutdown-worker queued.
That'll trip up "rds_conn_init_shutdown", which was
never adjusted to handle "RDS_CONN_RESETTING" and subsequently
drops the connection with the dreaded "DR_INV_CONN_STATE",
which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever.
So we do two things here:
a) Don't shortcut "RDS_CONN_ERROR", but take the longer
path through the shutdown code.
b) Add "RDS_CONN_RESETTING" to the expected states in
"rds_conn_init_shutdown" so that we won't error out
and get stuck, if we ever hit weird state transitions
like this again."
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260122055213.83608-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).
Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before memory provider
is installed, and in the future should also be called when ring
params are modified.
The validation is done after each layer of configuration.
Since we can't fail MP un-binding we must make sure that
the config is valid both before and after MP overrides are
applied. This is moot for now since the set of MP and device
configs are disjoint. It will matter significantly in the future,
so adding it now so that we don't forget..
Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We should follow the prepare/commit approach for queue configuration.
The qcfg struct should be added to dev->cfg rather than directly to
queue objects so that we can clone and discard the pending config
easily.
Remove the qcfg in struct netdev_rx_queue, and switch remaining callers
to netdev_queue_config(). netdev_queue_config() will construct the qcfg
on the fly based on device defaults and state of the queue.
ndo_default_qcfg becomes optional because having the callback itself
does not have any meaningful semantics to us.
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move mp->rx_page_size validation where the rest of MP input
validation lives. No other caller is modifying mp params so
validation logic in queue restarts is out of place.
Link: https://patch.msgid.link/20260122005113.2476634-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We may choose to extend or reimplement the logic which renders
the per-queue config. The drivers should not poke directly into
the queue state. Add a helper for drivers to use when they want
to query the config for a specific queue.
Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some/most devices implementing gso_partial need to disable the GSO partial
features when the IP ID can't be mangled; to that extend each of them
implements something alike the following[1]:
if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID))
features &= ~NETIF_F_TSO;
in the ndo_features_check() op, which leads to a bit of duplicate code.
Later patch in the series will implement GSO partial support for virtual
devices, and the current status quo will require more duplicate code and
a new indirect call in the TX path for them.
Introduce the mangleid_features mask, allowing the core to disable NIC
features based on/requiring MANGLEID, without any further intervention
from the driver.
The same functionality could be alternatively implemented adding a single
boolean flag to the struct net_device, but would require an additional
checks in ndo_features_check().
Also note that [1] is incorrect if the NIC additionally implements
NETIF_F_GSO_UDP_L4, mangleid_features transparently handle even such a
case.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/5a7cdaeea40b0a29b88e525b6c942d73ed3b8ce7.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Moving tcp_stream_memory_free() to tcp.c allows the compiler
to (auto)inline it from tcp_poll() and tcp_sendmsg_locked()
for better performance.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 2/0 up/down: 118/0 (118)
Function old new delta
tcp_poll 840 923 +83
tcp_sendmsg_locked 4217 4252 +35
Total: Before=22573095, After=22573213, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260122090228.1678207-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.19-rc7).
Conflicts:
drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
b35a6fd37a00 ("hinic3: Add adaptive IRQ coalescing with DIM")
fb2bb2a1ebf7 ("hinic3: Fix netif_queue_set_napi queue_index input parameter error")
https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com
drivers/net/wireless/ath/ath12k/mac.c
drivers/net/wireless/ath/ath12k/wifi7/hw.c
31707572108d ("wifi: ath12k: Fix wrong P2P device link id issue")
c26f294fef2a ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module")
https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au
Adjacent changes:
drivers/net/wireless/ath/ath12k/mac.c
8b8d6ee53dfd ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel")
914c890d3b90 ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_rate_check_app_limited() is used from tcp_sendmsg_locked()
fast path and from other callers.
Move it to tcp.c so that it can be inlined in tcp_sendmsg_locked().
Small increase of code, for better TCP performance.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/0 up/down: 87/0 (87)
Function old new delta
tcp_sendmsg_locked 4217 4304 +87
Total: Before=22566462, After=22566549, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260121095923.3134639-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This function is called from one caller only, in TCP fast path.
Move it to tcp_input.c so that compiler can inline it.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 226/-300 (-74)
Function old new delta
tcp_ack 5405 5631 +226
__pfx_tcp_rate_gen 16 - -16
tcp_rate_gen 284 - -284
Total: Before=22566536, After=22566462, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260121095923.3134639-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix memory leak in set_ssp_complete() where mgmt_pending_cmd structures
are not freed after being removed from the pending list.
Commit 302a1f674c00 ("Bluetooth: MGMT: Fix possible UAFs") replaced
mgmt_pending_foreach() calls with individual command handling but missed
adding mgmt_pending_free() calls in both error and success paths of
set_ssp_complete(). Other completion functions like set_le_complete()
were fixed correctly in the same commit.
This causes a memory leak of the mgmt_pending_cmd structure and its
associated parameter data for each SSP command that completes.
Add the missing mgmt_pending_free(cmd) calls in both code paths to fix
the memory leak. Also fix the same issue in set_advertising_complete().
Fixes: 302a1f674c00 ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
After the conversion to binary search array, this is not required anymore.
Remove it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Rework .get interface to use the binary search array, this needs a specific
lookup function to match on end intervals (<=). Packet path lookup is slight
different because match is on lesser value, not equal (ie. <).
After this patch, seqcount can be removed in a follow up patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
The rbtree can temporarily store overlapping inactive elements during
the transaction processing, leading to false negative lookups.
To address this issue, this patch adds a .commit function that walks the
the rbtree to build a array of intervals of ordered elements. This
conversion compacts the two singleton elements that represent the start
and the end of the interval into a single interval object for space
efficient.
Binary search is O(log n), similar to rbtree lookup time, therefore,
performance number should be similar, and there is an implementation
available under lib/bsearch.c and include/linux/bsearch.h that is used
for this purpose.
This slightly increases memory consumption for this new array that
stores pointers to the start and the end of the interval.
With this patch:
# time nft -f 100k-intervals-set.nft
real 0m4.218s
user 0m3.544s
sys 0m0.400s
Without this patch:
# time nft -f 100k-intervals-set.nft
real 0m3.920s
user 0m3.547s
sys 0m0.276s
With this patch, with IPv4 intervals:
baseline rbtree (match on first field only): 15254954pps
Without this patch:
baseline rbtree (match on first field only): 10256119pps
This provides a ~50% improvement in matching intervals from packet path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
The pipapo set backend is the only user of the .abort interface so far.
To speed up pipapo abort path, removals are skipped.
The follow up patch updates the rbtree to use to build an array of
ordered elements, then use binary search. This needs a new .abort
interface but, unlike pipapo, it also need to undo/remove elements.
Add a flag and use it from the pipapo set backend.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
tcf_ife_encode() must make sure ife_encode() does not return NULL.
syzbot reported:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:ife_tlv_meta_encode+0x41/0xa0 net/ife/ife.c:166
CPU: 3 UID: 0 PID: 8990 Comm: syz.0.696 Not tainted syzkaller #0 PREEMPT(full)
Call Trace:
<TASK>
ife_encode_meta_u32+0x153/0x180 net/sched/act_ife.c:101
tcf_ife_encode net/sched/act_ife.c:841 [inline]
tcf_ife_act+0x1022/0x1de0 net/sched/act_ife.c:877
tc_act include/net/tc_wrapper.h:130 [inline]
tcf_action_exec+0x1c0/0xa20 net/sched/act_api.c:1152
tcf_exts_exec include/net/pkt_cls.h:349 [inline]
mall_classify+0x1a0/0x2a0 net/sched/cls_matchall.c:42
tc_classify include/net/tc_wrapper.h:197 [inline]
__tcf_classify net/sched/cls_api.c:1764 [inline]
tcf_classify+0x7f2/0x1380 net/sched/cls_api.c:1860
multiq_classify net/sched/sch_multiq.c:39 [inline]
multiq_enqueue+0xe0/0x510 net/sched/sch_multiq.c:66
dev_qdisc_enqueue+0x45/0x250 net/core/dev.c:4147
__dev_xmit_skb net/core/dev.c:4262 [inline]
__dev_queue_xmit+0x2998/0x46c0 net/core/dev.c:4798
Fixes: 295a6e06d21e ("net/sched: act_ife: Change to use ife module")
Reported-by: syzbot+5cf914f193dffde3bd3c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6970d61d.050a0220.706b.0010.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yotam Gigi <yotam.gi@gmail.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260121133724.3400020-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Another set of updates:
- various small fixes for ath10k/ath12k/mwifiex/rsi
- cfg80211 fix for HE bitrate overflow
- mac80211 fixes
- S1G beacon handling in scan
- skb tailroom handling for HW encryption
- CSA fix for multi-link
- handling of disabled links during association
* tag 'wireless-2026-11-22' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: ignore link disabled flag from userspace
wifi: mac80211: apply advertised TTLM from association response
wifi: mac80211: parse all TTLM entries
wifi: mac80211: don't increment crypto_tx_tailroom_needed_cnt twice
wifi: mac80211: don't perform DA check on S1G beacon
wifi: ath12k: Fix wrong P2P device link id issue
wifi: ath12k: fix dead lock while flushing management frames
wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel
wifi: ath12k: cancel scan only on active scan vdev
wifi: mwifiex: Fix a loop in mwifiex_update_ampdu_rxwinsize()
wifi: mac80211: correctly check if CSA is active
wifi: cfg80211: Fix bitrate calculation overflow for HE rates
wifi: rsi: Fix memory corruption due to not set vif driver data size
wifi: ath12k: don't force radio frequency check in freq_to_idx()
wifi: ath12k: fix dma_free_coherent() pointer
wifi: ath10k: fix dma_free_coherent() pointer
====================
Link: https://patch.msgid.link/20260122110248.15450-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The virtio transports derives its TX credit directly from peer_buf_alloc,
which is set from the remote endpoint's SO_VM_SOCKETS_BUFFER_SIZE value.
On the host side this means that the amount of data we are willing to
queue for a connection is scaled by a guest-chosen buffer size, rather
than the host's own vsock configuration. A malicious guest can advertise
a large buffer and read slowly, causing the host to allocate a
correspondingly large amount of sk_buff memory.
The same thing would happen in the guest with a malicious host, since
virtio transports share the same code base.
Introduce a small helper, virtio_transport_tx_buf_size(), that
returns min(peer_buf_alloc, buf_alloc), and use it wherever we consume
peer_buf_alloc.
This ensures the effective TX window is bounded by both the peer's
advertised buffer and our own buf_alloc (already clamped to
buffer_max_size via SO_VM_SOCKETS_BUFFER_MAX_SIZE), so a remote peer
cannot force the other to queue more data than allowed by its own
vsock settings.
On an unpatched Ubuntu 22.04 host (~64 GiB RAM), running a PoC with
32 guest vsock connections advertising 2 GiB each and reading slowly
drove Slab/SUnreclaim from ~0.5 GiB to ~57 GiB; the system only
recovered after killing the QEMU process. That said, if QEMU memory is
limited with cgroups, the maximum memory used will be limited.
With this patch applied:
Before:
MemFree: ~61.6 GiB
Slab: ~142 MiB
SUnreclaim: ~117 MiB
After 32 high-credit connections:
MemFree: ~61.5 GiB
Slab: ~178 MiB
SUnreclaim: ~152 MiB
Only ~35 MiB increase in Slab/SUnreclaim, no host OOM, and the guest
remains responsive.
Compatibility with non-virtio transports:
- VMCI uses the AF_VSOCK buffer knobs to size its queue pairs per
socket based on the local vsk->buffer_* values; the remote side
cannot enlarge those queues beyond what the local endpoint
configured.
- Hyper-V's vsock transport uses fixed-size VMBus ring buffers and
an MTU bound; there is no peer-controlled credit field comparable
to peer_buf_alloc, and the remote endpoint cannot drive in-flight
kernel memory above those ring sizes.
- The loopback path reuses virtio_transport_common.c, so it
naturally follows the same semantics as the virtio transport.
This change is limited to virtio_transport_common.c and thus affects
virtio-vsock, vhost-vsock, and loopback, bringing them in line with the
"remote window intersected with local policy" behaviour that VMCI and
Hyper-V already effectively have.
Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko")
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Melbin K Mathew <mlbnkm1@gmail.com>
[Stefano: small adjustments after changing the previous patch]
[Stefano: tweak the commit message]
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20260121093628.9941-4-sgarzare@redhat.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The credit calculation in virtio_transport_get_credit() uses unsigned
arithmetic:
ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
If the peer shrinks its advertised buffer (peer_buf_alloc) while bytes
are in flight, the subtraction can underflow and produce a large
positive value, potentially allowing more data to be queued than the
peer can handle.
Reuse virtio_transport_has_space() which already handles this case and
add a comment to make it clear why we are doing that.
Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko")
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Melbin K Mathew <mlbnkm1@gmail.com>
[Stefano: use virtio_transport_has_space() instead of duplicating the code]
[Stefano: tweak the commit message]
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20260121093628.9941-2-sgarzare@redhat.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In ovs_vport_get_upcall_stats(), some statistics protected by
u64_stats_sync, are read and accumulated in ignorance of possible
u64_stats_fetch_retry() events. These statistics are already accumulated
by u64_stats_inc(). Fix this by reading them into temporary variables
first.
Fixes: 1933ea365aa7 ("net: openvswitch: Add support to count upcall packets")
Signed-off-by: David Yang <mmyangfl@gmail.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260121072932.2360971-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
If skb_cow() is passed a headroom <= -NET_SKB_PAD, it will trigger a
BUG. As a result, use cases should avoid calling with a headroom that
is negative to prevent triggering this issue.
This is the same code pattern fixed in Commit 58fc7342b529 ("ipv6:
BUG() in pskb_expand_head() as part of calipso_skbuff_setattr()").
In cipso_v4_skbuff_setattr(), len_delta can become negative, leading to
a negative headroom passed to skb_cow(). However, the BUG is not
triggerable because the condition headroom <= -NET_SKB_PAD cannot be
satisfied due to limits on the IPv4 options header size.
Avoid potential problems in the future by only using skb_cow() to grow
the skb headroom.
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20260120155738.982771-1-whrosenb@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
Subject: netfilter: updates for net-next
1) Speed up nftables transactions after earlier transaction failed.
Due to a (harmeless) bug we remained in slow paranoia mode until
a successful transaction completes.
2) Allow generic tracker to resolve clashes, this avoids very rare
packet drops. From Yuto Hamaguchi.
3) Increase the cleanup budget to 64 entries in nf_conncount to reap
more entries in one go, from Fernando Fernandez Mancera.
4) Allow icmp trackers to resolve clashes, this avoids very rare
initial packet drop with test cases that have high-frequency pings.
After this all trackers except tcp and sctp allow clash resolution.
5) Disentangle netfilter headers, don't include nftables/xtables headers
in subsystems that are unrelated.
6) Don't rely on implicit includes coming from nf_conntrack_proto_gre.h.
7) Allow nfnetlink_queue nfq instance struct to get accounted via memcg,
from Scott Mitchell.
8) Reject bogus xt target/match data upfront via netlink policiy in
nft_compat interface rather than relying on x_tables API to do it.
9) Fix nf_conncount breakage when trying to limit loopback flows via
prerouting rule, from Fernando Fernandez Mancera.
This is a recent breakage but not seen as urgent enough to rush this
via net tree at this late stage in development cycle.
10) Fix a possible off-by-one when parsing tcp option in xtables tcpmss
match. Also handled via -next due to late stage in development
cycle.
* tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: xt_tcpmss: check remaining length before reading optlen
netfilter: nf_conncount: fix tracking of connections from localhost
netfilter: nft_compat: add more restrictions on netlink attributes
netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation
netfilter: nf_conntrack: don't rely on implicit includes
netfilter: don't include xt and nftables.h in unrelated subsystems
netfilter: nf_conntrack: enable icmp clash support
netfilter: nf_conncount: increase the connection clean up limit to 64
netfilter: nf_conntrack: Add allow_clash to generic protocol handler
netfilter: nf_tables: reset table validation state on abort
====================
Link: https://patch.msgid.link/20260120191803.22208-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix the following:
BUG: KCSAN: data-race in rxrpc_peer_keepalive_worker / rxrpc_send_data_packet
which is reporting an issue with the reads and writes to ->last_tx_at in:
conn->peer->last_tx_at = ktime_get_seconds();
and:
keepalive_at = peer->last_tx_at + RXRPC_KEEPALIVE_TIME;
The lockless accesses to these to values aren't actually a problem as the
read only needs an approximate time of last transmission for the purposes
of deciding whether or not the transmission of a keepalive packet is
warranted yet.
Also, as ->last_tx_at is a 64-bit value, tearing can occur on a 32-bit
arch.
Fix both of these by switching to an unsigned int for ->last_tx_at and only
storing the LSW of the time64_t. It can then be reconstructed at need
provided no more than 68 years has elapsed since the last transmission.
Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive")
Reported-by: syzbot+6182afad5045e6703b3d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/695e7cfb.050a0220.1c677c.036b.GAE@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/1107124.1768903985@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Prior to the blamed commit, the bridge_num range was from
0 to ds->max_num_bridges - 1. After the commit, it is from
1 to ds->max_num_bridges.
So this check:
if (bridge_num >= max)
return 0;
must be updated to:
if (bridge_num > max)
return 0;
in order to allow the last bridge_num value (==max) to be used.
This is easiest visible when a driver sets ds->max_num_bridges=1.
The observed behaviour is that even the first created bridge triggers
the netlink extack "Range of offloadable bridges exceeded" warning, and
is handled in software rather than being offloaded.
Fixes: 3f9bb0301d50 ("net: dsa: make dp->bridge_num one-based")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260120211039.3228999-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove one function call from GRO stack for native IPv6 + TCP packets.
$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/0 grow/shrink: 1/1 up/down: 298/-5 (293)
Function old new delta
ipv6_gro_complete 435 733 +298
tcp6_gro_complete 311 306 -5
Total: Before=22593532, After=22593825, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
FDO/LTO are unable to inline tcp6_gro_receive() from ipv6_gro_receive()
Make sure tcp6_check_fraglist_gro() is only called only when needed,
so that compiler can leave it out-of-line.
$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/0 grow/shrink: 3/1 up/down: 1123/-253 (870)
Function old new delta
ipv6_gro_receive 1069 1846 +777
tcp6_check_fraglist_gro - 272 +272
ipv6_offload_init 218 274 +56
__pfx_tcp6_check_fraglist_gro - 16 +16
ipv6_gro_complete 433 435 +2
tcp6_gro_receive 959 706 -253
Total: Before=22592662, After=22593532, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We can change tcp_rsk() and inet_rsk() to propagate their argument
const qualifier thanks to container_of_const().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260120125353.1470456-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of
hex.h interfaces to directly #include <linux/hex.h> as part of the process
of putting kernel.h on a diet.
Removing hex.h from kernel.h means that 36K C source files don't have to
pay the price of parsing hex.h for the roughly 120 C source files that
need it.
This change has been build-tested with allmodconfig on most ARCHes. Also,
all users/callers of <linux/hex.h> in the entire source tree have been
updated if needed (if not already #included).
Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In nr_route_frame(), old_skb is immediately freed without checking if
nr_neigh->ax25 pointer is NULL. Therefore, if nr_neigh->ax25 is NULL,
the caller function will free old_skb again, causing a double-free bug.
Therefore, to prevent this, we need to modify it to check whether
nr_neigh->ax25 is NULL before freeing old_skb.
Cc: <stable@vger.kernel.org>
Reported-by: syzbot+999115c3bf275797dc27@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69694d6f.050a0220.58bed.0029.GAE@google.com/
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Link: https://patch.msgid.link/20260119063359.10604-1-aha310510@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
HIPPI has not been relevant for over two decades. It was rapidly
eclipsed by Fibre Channel, and even when it was new, it was
confined to very high-end hardware. The HIPPI code has only
received tree-wide changes and fixes by inspection in the entire
Git history. Remove HIPPI support and the rrunner HIPPI driver,
and move the former maintainer to the CREDITS file. Keep the
include/uapi/linux/if_hippi.h header because it is used by the TUN
code, and to avoid breaking userspace, however unlikely that may be.
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260119022451.22344-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_rate_skb_delivered() is only called from tcp_input.c.
Move it there and make it static.
Both gcc and clang are (auto)inlining it, TCP performance
is increased at a small space cost.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 3/0 up/down: 509/-187 (322)
Function old new delta
tcp_sacktag_walk 1682 1867 +185
tcp_ack 5230 5405 +175
tcp_shifted_skb 437 586 +149
__pfx_tcp_rate_skb_delivered 16 - -16
tcp_rate_skb_delivered 171 - -171
Total: Before=22566192, After=22566514, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260118123204.2315993-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot found that ndisc_router_discovery() could read and write
in6_dev->ra_mtu without holding a lock [1]
This looks fine, IFLA_INET6_RA_MTU is best effort.
Add READ_ONCE()/WRITE_ONCE() to document the race.
Note that we might also reject illegal MTU values
(mtu < IPV6_MIN_MTU || mtu > skb->dev->mtu) in a future patch.
[1]
BUG: KCSAN: data-race in ndisc_router_discovery / ndisc_router_discovery
read to 0xffff888119809c20 of 4 bytes by task 25817 on cpu 1:
ndisc_router_discovery+0x151d/0x1c90 net/ipv6/ndisc.c:1558
ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841
icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989
ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438
ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500
ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590
dst_input include/net/dst.h:474 [inline]
ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79
...
write to 0xffff888119809c20 of 4 bytes by task 25816 on cpu 0:
ndisc_router_discovery+0x155a/0x1c90 net/ipv6/ndisc.c:1559
ndisc_rcv+0x2ad/0x3d0 net/ipv6/ndisc.c:1841
icmpv6_rcv+0xe5a/0x12f0 net/ipv6/icmp.c:989
ip6_protocol_deliver_rcu+0xb2a/0x10d0 net/ipv6/ip6_input.c:438
ip6_input_finish+0xf0/0x1d0 net/ipv6/ip6_input.c:489
NF_HOOK include/linux/netfilter.h:318 [inline]
ip6_input+0x5e/0x140 net/ipv6/ip6_input.c:500
ip6_mc_input+0x27c/0x470 net/ipv6/ip6_input.c:590
dst_input include/net/dst.h:474 [inline]
ip6_rcv_finish+0x336/0x340 net/ipv6/ip6_input.c:79
...
value changed: 0x00000000 -> 0xe5400659
Fixes: 49b99da2c9ce ("ipv6: add IFLA_INET6_RA_MTU to expose mtu value")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rocco Yue <rocco.yue@mediatek.com>
Link: https://patch.msgid.link/20260118152941.2563857-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Looks like AI reviewers miss that napi_consume_skb() must have
a real budget passed to it. Let's see if adding a real kdoc will
help them figure this out.
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260119224140.1362729-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|