kernel/drivers/nvme, branch linux-6.9.y

nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA.

2024-07-25T07:53:33Z

[ Upstream commit 9570a48847e3acfa1a741cef431c923325ddc637 ] The value of NVME_NS_DEAC is 3, which means NVME_NS_METADATA_SUPPORTED | NVME_NS_EXT_LBAS. Provide a unique value for this feature flag. Fixes 1b96f862eccc ("nvme: implement the DEAC bit for the Write Zeroes command") Signed-off-by: Boyang Yu Reviewed-by: Kanchan Joshi Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvmet: always initialize cqe.result

2024-07-25T07:53:27Z

[ Upstream commit cd0c1b8e045a8d2785342b385cb2684d9b48e426 ] The spec doesn't mandate that the first two double words (aka results) for the command queue entry need to be set to 0 when they are not used (not specified). Though, the target implemention returns 0 for TCP and FC but not for RDMA. Let's make RDMA behave the same and thus explicitly initializing the result field. This prevents leaking any data from the stack. Signed-off-by: Daniel Wagner Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvme: avoid double free special payload

2024-07-25T07:53:27Z

[ Upstream commit e5d574ab37f5f2e7937405613d9b1a724811e5ad ] If a discard request needs to be retried, and that retry may fail before a new special payload is added, a double free will result. Clear the RQF_SPECIAL_LOAD when the request is cleaned. Signed-off-by: Chunguang Xu Reviewed-by: Sagi Grimberg Reviewed-by: Max Gurtovoy Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvme-fabrics: use reserved tag for reg read/write command

2024-07-25T07:53:23Z

[ Upstream commit 7dc3bfcb4c9cc58970fff6aaa48172cb224d85aa ] In some scenarios, if too many commands are issued by nvme command in the same time by user tasks, this may exhaust all tags of admin_q. If a reset (nvme reset or IO timeout) occurs before these commands finish, reconnect routine may fail to update nvme regs due to insufficient tags, which will cause kernel hang forever. In order to workaround this issue, maybe we can let reg_read32()/reg_read64()/reg_write32() use reserved tags. This maybe safe for nvmf: 1. For the disable ctrl path, we will not issue connect command 2. For the enable ctrl / fw activate path, since connect and reg_xx() are called serially. So the reserved tags may still be enough while reg_xx() use reserved tags. Signed-off-by: Chunguang Xu Reviewed-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvmet: fix a possible leak when destroy a ctrl during qp establishment

2024-07-11T10:51:23Z

[ Upstream commit c758b77d4a0a0ed3a1292b3fd7a2aeccd1a169a4 ] In nvmet_sq_destroy we capture sq->ctrl early and if it is non-NULL we know that a ctrl was allocated (in the admin connect request handler) and we need to release pending AERs, clear ctrl->sqs and sq->ctrl (for nvme-loop primarily), and drop the final reference on the ctrl. However, a small window is possible where nvmet_sq_destroy starts (as a result of the client giving up and disconnecting) concurrently with the nvme admin connect cmd (which may be in an early stage). But *before* kill_and_confirm of sq->ref (i.e. the admin connect managed to get an sq live reference). In this case, sq->ctrl was allocated however after it was captured in a local variable in nvmet_sq_destroy. This prevented the final reference drop on the ctrl. Solve this by re-capturing the sq->ctrl after all inflight request has completed, where for sure sq->ctrl reference is final, and move forward based on that. This issue was observed in an environment with many hosts connecting multiple ctrls simoutanuosly, creating a delay in allocating a ctrl leading up to this race window. Reported-by: Alex Turin Signed-off-by: Sagi Grimberg Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset

2024-07-11T10:51:22Z

[ Upstream commit 1bd293fcf3af84674e82ed022c049491f3768840 ] bio_vec start offset may be relatively large particularly when large folio gets added to the bio. A bigger offset will result in avoiding the single-segment mapping optimization and end up using expensive mempool_alloc further. Rather than using absolute value, adjust bv_offset by NVME_CTRL_PAGE_SIZE while checking if segment can be fitted into one/two PRP entries. Suggested-by: Christoph Hellwig Signed-off-by: Kundan Kumar Reviewed-by: Sagi Grimberg Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvme-multipath: find NUMA path only for online numa-node

2024-07-11T10:51:21Z

[ Upstream commit d3a043733f25d743f3aa617c7f82dbcb5ee2211a ] In current native multipath design when a shared namespace is created, we loop through each possible numa-node, calculate the NUMA distance of that node from each nvme controller and then cache the optimal IO path for future reference while sending IO. The issue with this design is that we may refer to the NUMA distance table for an offline node which may not be populated at the time and so we may inadvertently end up finding and caching a non-optimal path for IO. Then latter when the corresponding numa-node becomes online and hence the NUMA distance table entry for that node is created, ideally we should re-calculate the multipath node distance for the newly added node however that doesn't happen unless we rescan/reset the controller. So essentially, we may keep using non-optimal IO path for a node which is made online after namespace is created. This patch helps fix this issue ensuring that when a shared namespace is created, we calculate the multipath node distance for each online numa-node instead of each possible numa-node. Then latter when a node becomes online and we receive any IO on that newly added node, we would calculate the multipath node distance for newly added node but this time NUMA distance table would have been already populated for newly added node. Hence we would be able to correctly calculate the multipath node distance and choose the optimal path for the IO. Signed-off-by: Nilay Shroff Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[]

2024-07-05T07:38:12Z

commit 440e2051c577896275c0e0513ec26964e04c7810 upstream. Work for __counted_by on generic pointers in structures (not just flexible array members) has started landing in Clang 19 (current tip of tree). During the development of this feature, a restriction was added to __counted_by to prevent the flexible array member's element type from including a flexible array member itself such as: struct foo { int count; char buf[]; }; struct bar { int count; struct foo data[] __counted_by(count); }; because the size of data cannot be calculated with the standard array size formula: sizeof(struct foo) * count This restriction was downgraded to a warning but due to CONFIG_WERROR, it can still break the build. The application of __counted_by on the fod member of 'struct nvmet_fc_tgt_queue' triggers this restriction, resulting in: drivers/nvme/target/fc.c:151:2: error: 'counted_by' should not be applied to an array with element of unknown size because 'struct nvmet_fc_fcp_iod' is a struct type with a flexible array member. This will be an error in a future compiler version [-Werror,-Wbounds-safety-counted-by-elt-type-unknown-size] 151 | struct nvmet_fc_fcp_iod fod[] __counted_by(sqsize); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Remove this use of __counted_by to fix the warning/error. However, rather than remove it altogether, leave it commented, as it may be possible to support this in future compiler releases. Cc: stable@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/2027 Fixes: ccd3129aca28 ("nvmet-fc: Annotate struct nvmet_fc_tgt_queue with __counted_by") Signed-off-by: Nathan Chancellor Signed-off-by: Keith Busch Signed-off-by: Greg Kroah-Hartman

nvmet: make 'tsas' attribute idempotent for RDMA

2024-07-05T07:38:01Z

[ Upstream commit 0f1f5803920d2a6b88bee950914fd37421e17170 ] The RDMA transport defines values for TSAS, but it cannot be changed as we only support the 'connected' mode. So to avoid errors during reconfiguration we should allow to write the current value. Fixes: 3f123494db72 ("nvmet: make TCP sectype settable via configfs") Signed-off-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin

nvmet: do not return 'reserved' for empty TSAS values

2024-07-05T07:38:01Z

[ Upstream commit f31e85a4d7c6ac4a3e014129c9cdc31592ea29f3 ] The 'TSAS' value is only defined for TCP and RDMA, but returning 'reserved' for undefined values tricked nvmetcli to try to write 'reserved' when restoring from a config file. This caused an error and the configuration would not be applied. Fixes: 3f123494db72 ("nvmet: make TCP sectype settable via configfs") Signed-off-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Reviewed-by: Chaitanya Kulkarni Signed-off-by: Keith Busch Signed-off-by: Sasha Levin