<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/net/ipv6/inet6_hashtables.c, branch linux-6.2.y</title>
<subtitle>Hosts the 0x221E linux distro kernel.</subtitle>
<id>https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.2.y</id>
<link rel='self' href='https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.2.y'/>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/'/>
<updated>2022-09-20T17:21:49Z</updated>
<entry>
<title>tcp: Access &amp;tcp_hashinfo via net.</title>
<updated>2022-09-20T17:21:49Z</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-09-08T01:10:20Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=4461568aa4e565de2c336f4875ddf912f26da8a5'/>
<id>urn:sha1:4461568aa4e565de2c336f4875ddf912f26da8a5</id>
<content type='text'>
We will soon introduce an optional per-netns ehash.

This means we cannot use tcp_hashinfo directly in most places.

Instead, access it via net-&gt;ipv4.tcp_death_row.hashinfo.

The access will be valid only while initialising tcp_hashinfo
itself and creating/destroying each netns.

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>ipv6: add READ_ONCE(sk-&gt;sk_bound_dev_if) in INET6_MATCH()</title>
<updated>2022-05-16T09:31:06Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2022-05-13T18:55:49Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=5d368f03280d3678433a7f119efe15dfbbb87bc8'/>
<id>urn:sha1:5d368f03280d3678433a7f119efe15dfbbb87bc8</id>
<content type='text'>
INET6_MATCH() runs without holding a lock on the socket.

We probably need to annotate most reads.

This patch makes INET6_MATCH() an inline function
to ease our changes.

v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
    Change the name to inet6_match(), this is no longer a macro.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: inet: Retire port only listening_hash</title>
<updated>2022-05-12T23:52:18Z</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2022-05-12T00:06:05Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120'/>
<id>urn:sha1:cae3873c5b3a4fcd9706fb461ff4e91bdf1f0120</id>
<content type='text'>
The listen sk is currently stored in two hash tables,
listening_hash (hashed by port) and lhash2 (hashed by port and address).

After commit 0ee58dad5b06 ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f6431f ("net: tcp: prefer listeners bound to an address"),
the TCP-SYN lookup fast path does not use listening_hash.

The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2")
also moved the seq_file (/proc/net/tcp) iteration usage from
listening_hash to lhash2.

There are still a few listening_hash usages left.
One of them is inet_reuseport_add_sock() which uses the listening_hash
to search a listen sk during the listen() system call.  This turns
out to be very slow on use cases that listen on many different
VIPs at a popular port (e.g. 443).  [ On top of the slowness in
adding to the tail in the IPv6 case ].  The latter patch has a
selftest to demonstrate this case.

This patch takes this chance to move all remaining listening_hash
usages to lhash2 and then retire listening_hash.

Since most changes need to be done together, it is hard to cut
the listening_hash to lhash2 switch into small patches.  The
changes in this patch is highlighted here for the review
purpose.

1. Because of the listening_hash removal, lhash2 can use the
   sk-&gt;sk_nulls_node instead of the icsk-&gt;icsk_listen_portaddr_node.
   This will also keep the sk_unhashed() check to work as is
   after stop adding sk to listening_hash.

   The union is removed from inet_listen_hashbucket because
   only nulls_head is needed.

2. icsk-&gt;icsk_listen_portaddr_node and its helpers are removed.

3. The current lhash2 users needs to iterate with sk_nulls_node
   instead of icsk_listen_portaddr_node.

   One case is in the inet[6]_lhash2_lookup().

   Another case is the seq_file iterator in tcp_ipv4.c.
   One thing to note is sk_nulls_next() is needed
   because the old inet_lhash2_for_each_icsk_continue()
   does a "next" first before iterating.

4. Move the remaining listening_hash usage to lhash2

   inet_reuseport_add_sock() which this series is
   trying to improve.

   inet_diag.c and mptcp_diag.c are the final two
   remaining use cases and is moved to lhash2 now also.

Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>secure_seq: use the 64 bits of the siphash for port offset calculation</title>
<updated>2022-05-05T02:22:20Z</updated>
<author>
<name>Willy Tarreau</name>
<email>w@1wt.eu</email>
</author>
<published>2022-05-02T08:46:08Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=b2d057560b8107c633b39aabe517ff9d93f285e3'/>
<id>urn:sha1:b2d057560b8107c633b39aabe517ff9d93f285e3</id>
<content type='text'>
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
7cd23e5300c1 ("secure_seq: use SipHash in place of MD5"), but the output
remained truncated to 32-bit only. In order to exploit more bits from the
hash, let's make the functions return the full 64-bit of siphash_3u32().
We also make sure the port offset calculation in __inet_hash_connect()
remains done on 32-bit to avoid the need for div_u64_rem() and an extra
cost on 32-bit systems.

Cc: Jason A. Donenfeld &lt;Jason@zx2c4.com&gt;
Cc: Moshe Kol &lt;moshe.kol@mail.huji.ac.il&gt;
Cc: Yossi Gilad &lt;yossi.gilad@mail.huji.ac.il&gt;
Cc: Amit Klein &lt;aksecurity@gmail.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.</title>
<updated>2022-02-10T05:28:36Z</updated>
<author>
<name>Sebastian Andrzej Siewior</name>
<email>bigeasy@linutronix.de</email>
</author>
<published>2022-02-09T18:56:57Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=4f9bf2a2f5aacf988e6d5e56b961ba45c5a25248'/>
<id>urn:sha1:4f9bf2a2f5aacf988e6d5e56b961ba45c5a25248</id>
<content type='text'>
Commit
   9652dc2eb9e40 ("tcp: relax listening_hash operations")

removed the need to disable bottom half while acquiring
listening_hash.lock. There are still two callers left which disable
bottom half before the lock is acquired.

On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves is also acquired with enabled bottom halves
followed by disabling bottom halves. This is the reverse locking order.
It has been observed with inet_listen_hashbucket::lock:

local_bh_disable() + spin_lock(&amp;ilb-&gt;lock):
  inet_listen()
    inet_csk_listen_start()
      sk-&gt;sk_prot-&gt;hash() := inet_hash()
	local_bh_disable()
	__inet_hash()
	  spin_lock(&amp;ilb-&gt;lock);
	    acquire(&amp;ilb-&gt;lock);

Reverse order: spin_lock(&amp;ilb2-&gt;lock) + local_bh_disable():
  tcp_seq_next()
    listening_get_next()
      spin_lock(&amp;ilb2-&gt;lock);
	acquire(&amp;ilb2-&gt;lock);

  tcp4_seq_show()
    get_tcp4_sock()
      sock_i_ino()
	read_lock_bh(&amp;sk-&gt;sk_callback_lock);
	  acquire(softirq_ctrl)	// &lt;---- whoops
	  acquire(&amp;sk-&gt;sk_callback_lock)

Drop local_bh_disable() around __inet_hash() which acquires
listening_hash-&gt;lock. Split inet_unhash() and acquire the
listen_hashbucket lock without disabling bottom halves; the inet_ehash
lock with disabled bottom halves.

Reported-by: Mike Galbraith &lt;efault@gmx.de&gt;
Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Link: https://lkml.kernel.org/r/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de
Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/r/YgQOebeZ10eNx1W6@linutronix.de
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Add ingress_ifindex to bpf_sk_lookup</title>
<updated>2021-11-11T00:29:58Z</updated>
<author>
<name>Mark Pashmfouroush</name>
<email>markpash@cloudflare.com</email>
</author>
<published>2021-11-10T11:10:15Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=f89315650ba34ec6c91a8bded72796980bee2a4d'/>
<id>urn:sha1:f89315650ba34ec6c91a8bded72796980bee2a4d</id>
<content type='text'>
It may be helpful to have access to the ifindex during bpf socket
lookup. An example may be to scope certain socket lookup logic to
specific interfaces, i.e. an interface may be made exempt from custom
lookup code.

Add the ifindex of the arriving connection to the bpf_sk_lookup API.

Signed-off-by: Mark Pashmfouroush &lt;markpash@cloudflare.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com
</content>
</entry>
<entry>
<title>net: prefer socket bound to interface when not in VRF</title>
<updated>2021-10-07T14:27:55Z</updated>
<author>
<name>Mike Manning</name>
<email>mvrmanning@gmail.com</email>
</author>
<published>2021-10-05T13:03:42Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=8d6c414cd2fb74aa6812e9bfec6178f8246c4f3a'/>
<id>urn:sha1:8d6c414cd2fb74aa6812e9bfec6178f8246c4f3a</id>
<content type='text'>
The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be
chosen when not in a VRF") modified compute_score() so that a device
match is always made, not just in the case of an l3mdev skb, then
increments the score also for unbound sockets. This ensures that
sockets bound to an l3mdev are never selected when not in a VRF.
But as unbound and bound sockets are now scored equally, this results
in the last opened socket being selected if there are matches in the
default VRF for an unbound socket and a socket bound to a dev that is
not an l3mdev. However, handling prior to this commit was to always
select the bound socket in this case. Reinstate this handling by
incrementing the score only for bound sockets. The required isolation
due to choosing between an unbound socket and a socket bound to an
l3mdev remains in place due to the device match always being made.
The same approach is taken for compute_score() for stream sockets.

Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
Signed-off-by: Mike Manning &lt;mmanning@vyatta.att-mail.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Link: https://lore.kernel.org/r/cf0a8523-b362-1edf-ee78-eef63cbbb428@gmail.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: ipv6: remove unused arg exact_dif in compute_score</title>
<updated>2020-08-31T20:08:10Z</updated>
<author>
<name>Miaohe Lin</name>
<email>linmiaohe@huawei.com</email>
</author>
<published>2020-08-31T06:26:10Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=3f7d820bad6cace3ecb859bf3fa5bc56fe8bb1c6'/>
<id>urn:sha1:3f7d820bad6cace3ecb859bf3fa5bc56fe8bb1c6</id>
<content type='text'>
The arg exact_dif is not used anymore, remove it. inet6_exact_dif_match()
is no longer needed after the above is removed, remove it too.

Signed-off-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>inet6: Run SK_LOOKUP BPF program on socket lookup</title>
<updated>2020-07-18T03:18:17Z</updated>
<author>
<name>Jakub Sitnicki</name>
<email>jakub@cloudflare.com</email>
</author>
<published>2020-07-17T10:35:27Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=1122702f02678597c4f1c7d316365ef502aafe08'/>
<id>urn:sha1:1122702f02678597c4f1c7d316365ef502aafe08</id>
<content type='text'>
Following ipv4 stack changes, run a BPF program attached to netns before
looking up a listening socket. Program can return a listening socket to use
as result of socket lookup, fail the lookup, or take no action.

Suggested-by: Marek Majkowski &lt;marek@cloudflare.com&gt;
Signed-off-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/20200717103536.397595-7-jakub@cloudflare.com
</content>
</entry>
<entry>
<title>inet6: Extract helper for selecting socket from reuseport group</title>
<updated>2020-07-18T03:18:17Z</updated>
<author>
<name>Jakub Sitnicki</name>
<email>jakub@cloudflare.com</email>
</author>
<published>2020-07-17T10:35:26Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=5df6531292b5021ac9e4ed261eb7d1fa9ff3bf08'/>
<id>urn:sha1:5df6531292b5021ac9e4ed261eb7d1fa9ff3bf08</id>
<content type='text'>
Prepare for calling into reuseport from inet6_lookup_listener as well.

Signed-off-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Andrii Nakryiko &lt;andriin@fb.com&gt;
Link: https://lore.kernel.org/bpf/20200717103536.397595-6-jakub@cloudflare.com
</content>
</entry>
</feed>
