kernel/include/linux/mlx5, branch linux-6.17.y

mlx5: Fix default values in create CQ

2025-11-24T09:37:31Z

[ Upstream commit e5eba42f01340f73888dfe560be2806057c25913 ] Currently, CQs without a completion function are assigned the mlx5_add_cq_to_tasklet function by default. This is problematic since only user CQs created through the mlx5_ib driver are intended to use this function. Additionally, all CQs that will use doorbells instead of polling for completions must call mlx5_cq_arm. However, the default CQ creation flow leaves a valid value in the CQ's arm_db field, allowing FW to send interrupts to polling-only CQs in certain corner cases. These two factors would allow a polling-only kernel CQ to be triggered by an EQ interrupt and call a completion function intended only for user CQs, causing a null pointer exception. Some areas in the driver have prevented this issue with one-off fixes but did not address the root cause. This patch fixes the described issue by adding defaults to the create CQ flow. It adds a default dummy completion function to protect against null pointer exceptions, and it sets an invalid command sequence number by default in kernel CQs to prevent the FW from sending an interrupt to the CQ until it is armed. User CQs are responsible for their own initialization values. Callers of mlx5_core_create_cq are responsible for changing the completion function and arming the CQ per their needs. Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO") Signed-off-by: Akiva Goldberger Reviewed-by: Moshe Shemesh Signed-off-by: Tariq Toukan Acked-by: Leon Romanovsky Link: https://patch.msgid.link/1762681743-1084694-1-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni Signed-off-by: Sasha Levin

net/mlx5e: Prepare for using different CQ doorbells

2025-11-24T09:37:31Z

[ Upstream commit a315b723e87ba4e4573e1e5c759d512f38bdc0b3 ] Completion queues (CQs) in mlx5 use the same global doorbell, which may become contended when accessed concurrently from many cores. This patch prepares the CQ management code for supporting different doorbells per CQ. This will be used in downstream patches to allow separate doorbells to be used by channels CQs. The main change is moving the 'uar' pointer from struct mlx5_core_cq to struct mlx5e_cq, as the uar page to be used is better off stored directly there. Other users of mlx5_core_cq also store the UAR to be used separately and therefore the pointer being removed is dead weight for them. As evidence, in this patch there are two users which set the mcq.uar pointer but didn't use it, Software Steering and old Innova CQ creation code. Instead, they rang the doorbell directly from another pointer. The 'uar' pointer added to struct mlx5e_cq remains in a hot cacheline (as before), because it may get accessed for each packet. Signed-off-by: Cosmin Ratiu Reviewed-by: Dragos Tatulea Signed-off-by: Tariq Toukan Reviewed-by: Simon Horman Signed-off-by: Jakub Kicinski Stable-dep-of: e5eba42f0134 ("mlx5: Fix default values in create CQ") Signed-off-by: Sasha Levin

net/mlx5: Store the global doorbell in mlx5_priv

2025-11-24T09:37:31Z

[ Upstream commit aa4595d0ada65d5d44fa924a42a87c175d9d88e3 ] The global doorbell is used for more than just Ethernet resources, so move it out of mlx5e_hw_objs into a common place (mlx5_priv), to avoid non-Ethernet modules (e.g. HWS, ASO) depending on Ethernet structs. Use this opportunity to consolidate it with the 'uar' pointer already there, which was used as an RX doorbell. Underneath the 'uar' pointer is identical to 'bfreg->up', so store a single resource and use that instead. For CQ doorbells, care is taken to always use bfreg->up->index instead of bfreg->index, which may refer to a subsequent UAR page from the same ALLOC_UAR batch on some NICs. This paves the way for cleanly supporting multiple doorbells in the Ethernet driver. Signed-off-by: Cosmin Ratiu Reviewed-by: Dragos Tatulea Signed-off-by: Tariq Toukan Reviewed-by: Simon Horman Signed-off-by: Jakub Kicinski Stable-dep-of: e5eba42f0134 ("mlx5: Fix default values in create CQ") Signed-off-by: Sasha Levin

net/mlx5: fs, fix UAF in flow counter release

2025-09-24T00:17:30Z

Fix a kernel trace [1] caused by releasing an HWS action of a local flow counter in mlx5_cmd_hws_delete_fte(), where the HWS action refcount and mutex were not initialized and the counter struct could already be freed when deleting the rule. Fix it by adding the missing initializations and adding refcount for the local flow counter struct. [1] Kernel log: Call Trace: dump_stack_lvl+0x34/0x48 mlx5_fs_put_hws_action.part.0.cold+0x21/0x94 [mlx5_core] mlx5_fc_put_hws_action+0x96/0xad [mlx5_core] mlx5_fs_destroy_fs_actions+0x8b/0x152 [mlx5_core] mlx5_cmd_hws_delete_fte+0x5a/0xa0 [mlx5_core] del_hw_fte+0x1ce/0x260 [mlx5_core] mlx5_del_flow_rules+0x12d/0x240 [mlx5_core] ? ttwu_queue_wakelist+0xf4/0x110 mlx5_ib_destroy_flow+0x103/0x1b0 [mlx5_ib] uverbs_free_flow+0x20/0x50 [ib_uverbs] destroy_hw_idr_uobject+0x1b/0x50 [ib_uverbs] uverbs_destroy_uobject+0x34/0x1a0 [ib_uverbs] uobj_destroy+0x3c/0x80 [ib_uverbs] ib_uverbs_run_method+0x23e/0x360 [ib_uverbs] ? uverbs_finalize_object+0x60/0x60 [ib_uverbs] ib_uverbs_cmd_verbs+0x14f/0x2c0 [ib_uverbs] ? do_tty_write+0x1a9/0x270 ? file_tty_write.constprop.0+0x98/0xc0 ? new_sync_write+0xfc/0x190 ib_uverbs_ioctl+0xd7/0x160 [ib_uverbs] __x64_sys_ioctl+0x87/0xc0 do_syscall_64+0x59/0x90 Fixes: b581f4266928 ("net/mlx5: fs, manage flow counters HWS action sharing by refcount") Signed-off-by: Moshe Shemesh Reviewed-by: Yevgeny Kliteynik Reviewed-by: Mark Bloch Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/1758525094-816583-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski

net/mlx5e: Harden uplink netdev access against device unbind

2025-09-17T00:19:11Z

The function mlx5_uplink_netdev_get() gets the uplink netdevice pointer from mdev->mlx5e_res.uplink_netdev. However, the netdevice can be removed and its pointer cleared when unbound from the mlx5_core.eth driver. This results in a NULL pointer, causing a kernel panic. BUG: unable to handle page fault for address: 0000000000001300 at RIP: 0010:mlx5e_vport_rep_load+0x22a/0x270 [mlx5_core] Call Trace: mlx5_esw_offloads_rep_load+0x68/0xe0 [mlx5_core] esw_offloads_enable+0x593/0x910 [mlx5_core] mlx5_eswitch_enable_locked+0x341/0x420 [mlx5_core] mlx5_devlink_eswitch_mode_set+0x17e/0x3a0 [mlx5_core] devlink_nl_eswitch_set_doit+0x60/0xd0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x183/0x290 netlink_rcv_skb+0x4b/0xf0 genl_rcv+0x24/0x40 netlink_unicast+0x255/0x380 netlink_sendmsg+0x1f3/0x420 __sock_sendmsg+0x38/0x60 __sys_sendto+0x119/0x180 do_syscall_64+0x53/0x1d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Ensure the pointer is valid before use by checking it for NULL. If it is valid, immediately call netdev_hold() to take a reference, and preventing the netdevice from being freed while it is in use. Fixes: 7a9fb35e8c3a ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Jianbo Liu Reviewed-by: Cosmin Ratiu Reviewed-by: Jiri Pirko Reviewed-by: Dragos Tatulea Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/1757939074-617281-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski

RDMA support for DMA handle

2025-07-23T05:38:56Z

From Yishai: This patch series introduces a new DMA Handle (DMAH) object, along with corresponding APIs for its allocation and deallocation. The DMAH object encapsulates attributes relevant for DMA transactions. While initially intended to support TLP Processing Hints (TPH) [1], the design is extensible to accommodate future features such as PCI multipath for DMA, PCI UIO configurations, traffic class selection, and more. Additionally, we introduce a new ioctl method on the MR object: UVERBS_METHOD_REG_MR. This method consolidates multiple reg_mr variants under a single user-space ioctl interface, supporting: ibv_reg_mr(), ibv_reg_mr_iova(), ibv_reg_mr_iova2() and ibv_reg_dmabuf_mr(). It also enables passing a DMA handle as part of the registration process. Throughout the patch series, the following DMAH-related stuff can also be observed in the IB layer: - Association with a CPU ID and its memory type, for use with Steering Tags [2]. - Inclusion of Processing Hints (PH) data for TPH functionality [3]. - Enforces security by ensuring that only tasks allowed to run on a given CPU may request a DMA handle for it. - Reference counting for DMAH life cycle management and safe usage across memory regions. mlx5 driver implementation: -------------------------- The series includes implementation of the above functionality in the mlx5 driver. In mlx5_core: - Enables TPH over PCIe when both firmware and OS support it. - Manages Steering Tags and corresponding indices by writing tag values to the PCI configuration space. - Exposes APIs to upper layers (e.g., mlx5_ib) to enable the PCIe TPH functionality. In mlx5_ib: - Adds full support for DMAH operations. - Utilizes mlx5_core's Steering Tag APIs to derive tag indices from input. - Stores the resulting index in a mlx5_dmah structure for use during MKEY creation with a DMA handle. - Adds support for allowing MKEYs to be created in conjunction with DMA handles. Additional details are provided in the commit messages. [1] Background, from PCIe specification 6.2. TLP Processing Hints (TPH) -------------------------- TLP Processing Hints is an optional feature that provides hints in Request TLP headers to facilitate optimized processing of Requests that target Memory Space. These Processing Hints enable the system hardware (e.g., the Root Complex and/ or Endpoints) to optimize platform resources such as system and memory interconnect on a per TLP basis. Steering Tags are system-specific values used to identify a processing resource that a Requester explicitly targets. System software discovers and identifies TPH capabilities to determine the Steering Tag allocation for each Function that supports TPH [2] Steering Tags Functions that intend to target a TLP towards a specific processing resource such as a host processor or system cache hierarchy require topological information of the target cache (e.g., which host cache). Steering Tags are system-specific values that provide information about the host or cache structure in the system cache hierarchy. These values are used to associate processing elements within the platform with the processing of Requests. [3] Processing Hints The Requester provides hints to the Root Complex or other targets about the intended use of data and data structures by the host and/or device. The hints are provided by the Requester, which has knowledge of upcoming Request patterns, and which the Completer would not be able to deduce autonomously (with good accuracy) Yishai Signed-off-by: Leon Romanovsky * mlx5-next: net/mlx5: Add support for device steering tag net/mlx5: Expose IFC bits for TPH PCI/TPH: Expose pcie_tph_get_st_table_size() net/mlx5: Expose cable_length field in PFCC register net/mlx5: Add IFC bits and enums for buf_ownership net/mlx5: Add IFC bits to support RSS for IPSec offload net/mlx5: IFC updates for disabled host PF net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc

net/mlx5: Add support for device steering tag

2025-07-23T05:27:32Z

Background, from PCIe specification 6.2. TLP Processing Hints (TPH) -------------------------- TLP Processing Hints is an optional feature that provides hints in Request TLP headers to facilitate optimized processing of Requests that target Memory Space. These Processing Hints enable the system hardware (e.g., the Root Complex and/or Endpoints) to optimize platform resources such as system and memory interconnect on a per TLP basis. Steering Tags are system-specific values used to identify a processing resource that a Requester explicitly targets. System software discovers and identifies TPH capabilities to determine the Steering Tag allocation for each Function that supports TPH. This patch adds steering tag support for mlx5 based NICs by: - Enabling the TPH functionality over PCI if both FW and OS support it. - Managing steering tags and their matching steering indexes by writing a ST to an ST index over the PCI configuration space. - Exposing APIs to upper layers (e.g.,mlx5_ib) to allow usage of the PCI TPH infrastructure. Further details: - Upon probing of a device, the feature will be enabled based on both capability detection and OS support. - It will retrieve the appropriate ST for a given CPU ID and memory type using the pcie_tph_get_cpu_st() API. - It will track available ST indices according to the configuration space table size (expected to be 63 entries), reserving index 0 to indicate non-TPH use. - It will assign a free ST index with a ST using the pcie_tph_set_st_entry() API. - It will reuse the same index for identical (CPU ID + memory type) combinations by maintaining a reference count per entry. - It will expose APIs to upper layers (e.g., mlx5_ib) to allow usage of the PCI TPH infrastructure. - SF will use its parent PF stuff. Signed-off-by: Yishai Hadas Link: https://patch.msgid.link/de1ae7398e9e34eacd8c10845683df44fc9e32f8.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky

net/mlx5: Expose IFC bits for TPH

2025-07-23T05:27:32Z

Expose IFC bits for the TPH functionality. Signed-off-by: Yishai Hadas Reviewed-by: Edward Srouji Reviewed-by: Moshe Shemesh Link: https://patch.msgid.link/38ea3a0d56551364214e8edf359c9c77c9a3b71b.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky

net/mlx5: Expose cable_length field in PFCC register

2025-07-20T07:02:14Z

Introduce new "cable_length" field in PFCC register and related fields to enhance rx buffer configuration management: 1. cable_length: Shifts cable length handling to fw by storing a manually entered length from user in PFCC.cable_length 2. lane_rate_oper: In a case where PFCC.cable_length is not supported, helps compute a default cable length Signed-off-by: Oren Sidi Reviewed-by: Alex Lazar Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/1752734895-257735-4-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky

net/mlx5: Add IFC bits and enums for buf_ownership

2025-07-20T07:02:14Z

Extend structure layouts and defines buf_ownership. buf_ownership indicates whether the buffer is managed by SW or FW. Signed-off-by: Oren Sidi Reviewed-by: Alex Lazar Signed-off-by: Tariq Toukan Link: https://patch.msgid.link/1752734895-257735-3-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky