kernel/drivers/infiniband/core, branch linux-6.9.y

IB/core: Implement a limit on UMAD receive List

2024-07-11T10:50:57Z

[ Upstream commit ca0b44e20a6f3032224599f02e7c8fb49525c894 ] The existing behavior of ib_umad, which maintains received MAD packets in an unbounded list, poses a risk of uncontrolled growth. As user-space applications extract packets from this list, the rate of extraction may not match the rate of incoming packets, leading to potential list overflow. To address this, we introduce a limit to the size of the list. After considering typical scenarios, such as OpenSM processing, which can handle approximately 100k packets per second, and the 1-second retry timeout for most packets, we set the list size limit to 200k. Packets received beyond this limit are dropped, assuming they are likely timed out by the time they are handled by user-space. Notably, packets queued on the receive list due to reasons like timed-out sends are preserved even when the list is full. Signed-off-by: Michael Guralnik Reviewed-by: Mark Zhang Link: https://lore.kernel.org/r/7197cb58a7d9e78399008f25036205ceab07fbd5.1713268818.git.leon@kernel.org Signed-off-by: Leon Romanovsky Signed-off-by: Sasha Levin

RDMA/restrack: Fix potential invalid address access

2024-07-05T07:37:59Z

[ Upstream commit ca537a34775c103f7b14d7bbd976403f1d1525d8 ] struct rdma_restrack_entry's kern_name was set to KBUILD_MODNAME in ib_create_cq(), while if the module exited but forgot del this rdma_restrack_entry, it would cause a invalid address access in rdma_restrack_clean() when print the owner of this rdma_restrack_entry. These code is used to help find one forgotten PD release in one of the ULPs. But it is not needed anymore, so delete them. Signed-off-by: Wenchao Hao Link: https://lore.kernel.org/r/20240318092320.1215235-1-haowenchao2@huawei.com Signed-off-by: Leon Romanovsky Signed-off-by: Sasha Levin

inet: introduce dst_rtable() helper

2024-06-12T09:39:55Z

[ Upstream commit 05d6d492097c55f2d153fc3fd33cbe78e1e28e0a ] I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper") This patch does a similar change for IPv4. Instead of (struct rtable *)dst casts, we can use : #define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst) Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper. Signed-off-by: Eric Dumazet Reviewed-by: David Ahern Reviewed-by: Sabrina Dubroca Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com Signed-off-by: Jakub Kicinski Stable-dep-of: 92f1655aa2b2 ("net: fix __dst_negative_advice() race") Signed-off-by: Sasha Levin

ipv6: introduce dst_rt6_info() helper

2024-06-12T09:39:54Z

[ Upstream commit e8dfd42c17faf183415323db1ef0c977be0d6489 ] Instead of (struct rt6_info *)dst casts, we can use : #define dst_rt6_info(_ptr) \ container_of_const(_ptr, struct rt6_info, dst) Some places needed missing const qualifiers : ip6_confirm_neigh(), ipv6_anycast_destination(), ipv6_unicast_destination(), has_gateway() v2: added missing parts (David Ahern) Signed-off-by: Eric Dumazet Reviewed-by: David Ahern Signed-off-by: David S. Miller Stable-dep-of: 92f1655aa2b2 ("net: fix __dst_negative_advice() race") Signed-off-by: Sasha Levin

RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw

2024-05-30T07:45:01Z

[ Upstream commit 9c0731832d3b7420cbadba6a7f334363bc8dfb15 ] When running blktests nvme/rdma, the following kmemleak issue will appear. kmemleak: Kernel memory leak detector initialized (mempool available:36041) kmemleak: Automatic memory scanning thread started kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 8 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 17 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak) unreferenced object 0xffff88855da53400 (size 192): comm "rdma", pid 10630, jiffies 4296575922 hex dump (first 32 bytes): 37 00 00 00 00 00 00 00 c0 ff ff ff 1f 00 00 00 7............... 10 34 a5 5d 85 88 ff ff 10 34 a5 5d 85 88 ff ff .4.].....4.].... backtrace (crc 47f66721): [] kmalloc_trace+0x30d/0x3b0 [] alloc_gid_entry+0x47/0x380 [ib_core] [] add_modify_gid+0x166/0x930 [ib_core] [] ib_cache_update.part.0+0x6d8/0x910 [ib_core] [] ib_cache_setup_one+0x24a/0x350 [ib_core] [] ib_register_device+0x9e/0x3a0 [ib_core] [] 0xffffffffc2a3d389 [] nldev_newlink+0x2b8/0x520 [ib_core] [] rdma_nl_rcv_msg+0x2c3/0x520 [ib_core] [] rdma_nl_rcv_skb.constprop.0.isra.0+0x23c/0x3a0 [ib_core] [] netlink_unicast+0x445/0x710 [] netlink_sendmsg+0x761/0xc40 [] __sys_sendto+0x3a9/0x420 [] __x64_sys_sendto+0xdc/0x1b0 [] do_syscall_64+0x93/0x180 [] entry_SYSCALL_64_after_hwframe+0x71/0x79 The root cause: rdma_put_gid_attr is not called when sgid_attr is set to ERR_PTR(-ENODEV). Reported-and-tested-by: Yi Zhang Closes: https://lore.kernel.org/all/19bf5745-1b3b-4b8a-81c2-20d945943aaf@linux.dev/T/ Fixes: f8ef1be816bf ("RDMA/cma: Avoid GID lookups on iWARP devices") Reviewed-by: Chuck Lever Signed-off-by: Zhu Yanjun Link: https://lore.kernel.org/r/20240510211247.31345-1-yanjun.zhu@linux.dev Signed-off-by: Leon Romanovsky Signed-off-by: Sasha Levin

RDMA/cm: Print the old state when cm_destroy_id gets timeout

2024-04-01T12:16:36Z

The old state is helpful for debugging, as the current state is always IB_CM_IDLE when timeout happens. Fixes: 96d9cbe2f2ff ("RDMA/cm: add timeout to cm_destroy_id wait") Signed-off-by: Mark Zhang Link: https://lore.kernel.org/r/20240322112049.2022994-1-markzhang@nvidia.com Signed-off-by: Leon Romanovsky

RDMA/cm: add timeout to cm_destroy_id wait

2024-03-10T11:17:54Z

Add timeout to cm_destroy_id, so that userspace can trigger any data collection that would help in analyzing the cause of delay in destroying the cm_id. New noinline function helps dtrace/ebpf programs to hook on to it. Existing functionality isn't changed except triggering a probe-able new function at every timeout interval. We have seen cases where CM messages stuck with MAD layer (either due to software bug or faulty HCA), leading to cm_id getting stuck in the following call stack. This patch helps in resolving such issues faster. kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds. ... Call Trace: __schedule+0x2bc/0x895 schedule+0x36/0x7c schedule_timeout+0x1f6/0x31f ? __slab_free+0x19c/0x2ba wait_for_completion+0x12b/0x18a ? wake_up_q+0x80/0x73 cm_destroy_id+0x345/0x610 [ib_cm] ib_destroy_cm_id+0x10/0x20 [ib_cm] rdma_destroy_id+0xa8/0x300 [rdma_cm] ucma_destroy_id+0x13e/0x190 [rdma_ucm] ucma_write+0xe0/0x160 [rdma_ucm] __vfs_write+0x3a/0x16d vfs_write+0xb2/0x1a1 ? syscall_trace_enter+0x1ce/0x2b8 SyS_write+0x5c/0xd3 do_syscall_64+0x79/0x1b9 entry_SYSCALL_64_after_hwframe+0x16d/0x0 Signed-off-by: Manjunath Patil Link: https://lore.kernel.org/r/20240309063323.458102-1-manjunath.b.patil@oracle.com Signed-off-by: Leon Romanovsky

RDMA/uverbs: Avoid -Wflex-array-member-not-at-end warnings

2024-03-03T13:38:44Z

-Wflex-array-member-not-at-end is coming in GCC-14, and we are getting ready to enable it globally. There are currently a couple of objects (`alloc_head` and `bundle`) in `struct bundle_priv` that contain a couple of flexible structures: struct bundle_priv { /* Must be first */ struct bundle_alloc_head alloc_head; ... /* * Must be last. bundle ends in a flex array which overlaps * internal_buffer. */ struct uverbs_attr_bundle bundle; u64 internal_buffer[32]; }; So, in order to avoid ending up with a couple of flexible-array members in the middle of a struct, we use the `struct_group_tagged()` helper to separate the flexible array from the rest of the members in the flexible structures: struct uverbs_attr_bundle { struct_group_tagged(uverbs_attr_bundle_hdr, hdr, ... the rest of the members ); struct uverbs_attr attrs[]; }; With the change described above, we now declare objects of the type of the tagged struct without embedding flexible arrays in the middle of another struct: struct bundle_priv { /* Must be first */ struct bundle_alloc_head_hdr alloc_head; ... struct uverbs_attr_bundle_hdr bundle; u64 internal_buffer[32]; }; We also use `container_of()` whenever we need to retrieve a pointer to the flexible structures. Notice that the `bundle_size` computed in `uapi_compute_bundle_size()` remains the same. So, with these changes, fix the following warnings: drivers/infiniband/core/uverbs_ioctl.c:45:34: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] 45 | struct bundle_alloc_head alloc_head; | ^~~~~~~~~~ drivers/infiniband/core/uverbs_ioctl.c:67:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] 67 | struct uverbs_attr_bundle bundle; | ^~~~~~ Signed-off-by: Gustavo A. R. Silva Link: https://lore.kernel.org/r/ZeIgeZ5Sb0IZTOyt@neat Reviewed-by: Kees Cook Signed-off-by: Leon Romanovsky

RDMA/uverbs: Remove flexible arrays from struct *_filter

2024-02-21T17:28:52Z

When a struct containing a flexible array is included in another struct, and there is a member after the struct-with-flex-array, there is a possibility of memory overlap. These cases must be audited [1]. See: struct inner { ... int flex[]; }; struct outer { ... struct inner header; int overlap; ... }; This is the scenario for all the "struct *_filter" structures that are included in the following "struct ib_flow_spec_*" structures: struct ib_flow_spec_eth struct ib_flow_spec_ib struct ib_flow_spec_ipv4 struct ib_flow_spec_ipv6 struct ib_flow_spec_tcp_udp struct ib_flow_spec_tunnel struct ib_flow_spec_esp struct ib_flow_spec_gre struct ib_flow_spec_mpls The pattern is like the one shown below: struct *_filter { ... u8 real_sz[]; }; struct ib_flow_spec_* { ... struct *_filter val; struct *_filter mask; }; In this case, the trailing flexible array "real_sz" is never allocated and is only used to calculate the size of the structures. Here the use of the "offsetof" helper can be changed by the "sizeof" operator because the goal is to get the size of these structures. Therefore, the trailing flexible arrays can also be removed. However, due to the trailing padding that can be induced in structs it is possible that the: offsetof(struct *_filter, real_sz) != sizeof(struct *_filter) This situation happens with the "struct ib_flow_ipv6_filter" and to avoid it the "__packed" macro is used in this structure. But now, the "sizeof(struct ib_flow_ipv6_filter)" has changed. This is not a problem since this size is not used in the code. The situation now is that "sizeof(struct ib_flow_spec_ipv6)" has also changed (this struct contains the struct ib_flow_ipv6_filter). This is also not a problem since it is only used to set the size of the "union ib_flow_spec", which can store all the "ib_flow_spec_*" structures. Link: https://lore.kernel.org/r/20240217142913.4285-1-erick.archer@gmx.com Signed-off-by: Erick Archer Signed-off-by: Jason Gunthorpe

RDMA/device: Fix a race between mad_client and cm_client init

2024-02-21T17:15:50Z

The mad_client will be initialized in enable_device_and_get(), while the devices_rwsem will be downgraded to a read semaphore. There is a window that leads to the failed initialization for cm_client, since it can not get matched mad port from ib_mad_port_list, and the matched mad port will be added to the list after that. mad_client | cm_client ------------------|-------------------------------------------------------- ib_register_device| enable_device_and_get down_write(&devices_rwsem) xa_set_mark(&devices, DEVICE_REGISTERED) downgrade_write(&devices_rwsem) | |ib_cm_init |ib_register_client(&cm_client) |down_read(&devices_rwsem) |xa_for_each_marked (&devices, DEVICE_REGISTERED) |add_client_context |cm_add_one |ib_register_mad_agent |ib_get_mad_port |__ib_get_mad_port |list_for_each_entry(entry, &ib_mad_port_list, port_list) |return NULL |up_read(&devices_rwsem) | add_client_context| ib_mad_init_device| ib_mad_port_open | list_add_tail(&port_priv->port_list, &ib_mad_port_list) up_read(&devices_rwsem) | Fix it by using down_write(&devices_rwsem) in ib_register_client(). Fixes: d0899892edd0 ("RDMA/device: Provide APIs from the core code to help unregistration") Link: https://lore.kernel.org/r/20240203035313.98991-1-lishifeng@sangfor.com.cn Suggested-by: Jason Gunthorpe Signed-off-by: Shifeng Li Signed-off-by: Jason Gunthorpe