kernel - Hosts the 0x221E linux distro kernel.

Age	Commit message (Collapse)	Author
2025-11-25	block: Remove references to __device_add_disk()	John Garry
	Since commit d1254a874971 ("block: remove support for delayed queue registrations"), function __device_add_disk() has been replaced with device_add_disk(), so fix up comments. Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-20	PCI/P2PDMA: Simplify bus address mapping API	Leon Romanovsky
	Update the pci_p2pdma_bus_addr_map() function to take a direct pointer to the p2pdma_provider structure instead of the pci_p2pdma_map_state. This simplifies the API by removing the need for callers to extract the provider from the state structure. The change updates all callers across the kernel (block layer, IOMMU, DMA direct, and HMM) to pass the provider pointer directly, making the code more explicit and reducing unnecessary indirection. This also removes the runtime warning check since callers now have direct control over which provider they use. Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-2-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20	block: use min() instead of min_t()	David Laight
	min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'. Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long' and so cannot discard significant bits. In this case the 'unsigned long' value is small enough that the result is ok. (Similarly for max_t() and clamp_t().) Detected by an extra check added to min_t(). Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18	block: remove the declaration of elevator_init_mq function	Chengkaitao
	In commit 1e44bedbc921 ("block: unifying elevator change"), the elevator_init_mq function was deleted, but its declaration in elevator.h was overlooked. This patch fixes it. Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18	Revert "block: consider discard merge last"	Jens Axboe
	This reverts commit 2516c246d01c23a5f5310e9ac78d9f8aad9b1d0e. Suspected issues with discard merging post this patch, hence revert it for now. Link: https://lore.kernel.org/linux-block/26acdfdf-de13-430b-8c73-f890c7689a84@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18	block: Remove queue freezing from several sysfs store callbacks	Bart Van Assche
	Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously. Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. Add the __data_racy annotation to request_queue.rq_timeout to suppress KCSAN data race reports about the rq_timeout reads. This patch may cause a small delay in applying the new settings. For all the attributes affected by this patch, I/O will complete correctly whether the old or the new value of the attribute is used. This patch affects the following sysfs attributes: * io_poll_delay * io_timeout * nomerges * read_ahead_kb * rq_affinity Here is an example of a deadlock triggered by running test srp/002 if this patch is not applied: task:multipathd Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:(udev-worker) Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Martin Wilck <mwilck@suse.com> Cc: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18	block: plug attempts to batch allocate tags multiple times	Xue He
	This patch aims to enable batch allocation of sufficient tags after batch IO submission with plug mechanism, thereby avoiding the need for frequent individual requests when the initial allocation is insufficient. ----------------------------------------------------------- HW: 16 CPUs/16 poll queues Disk: Samsung PM9A3 Gen4 3.84T CMD: [global] ioengine=io_uring group_reporting=1 time_based=1 runtime=1m refill_buffers=1 norandommap=1 randrepeat=0 fixedbufs=1 registerfiles=1 rw=randread iodepth=128 iodepth_batch_submit=32 iodepth_batch_complete_min=32 iodepth_batch_complete_max=128 iodepth_low=32 bs=4k numjobs=1 direct=1 hipri=1 [job1] filename=/dev/nvme0n1 name=batch_test ------------------------------------------------------------ Perf: base code: __blk_mq_alloc_requests() 1.47% patch: __blk_mq_alloc_requests() 0.75% ------------------------------------------------------------ Signed-off-by: hexue <xue01.he@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17	block: rate-limit capacity change info log	Li Chen
	loop devices under heavy stress-ng loop streessor can trigger many capacity change events in a short time. Each event prints an info message from set_capacity_and_notify(), flooding the console and contributing to soft lockups on slow consoles. Switch the printk in set_capacity_and_notify() to pr_info_ratelimited() so frequent capacity changes do not spam the log while still reporting occasional changes. Cc: stable@vger.kernel.org Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17	Merge tag 'vfs-6.18-rc7.fixes' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix unitialized variable in statmount_string() - Fix hostfs mounting when passing host root during boot - Fix dynamic lookup to fail on cell lookup failure - Fix missing file type when reading bfs inodes from disk - Enforce checking of sb_min_blocksize() calls and update all callers accordingly - Restore write access before closing files opened by open_exec() in binfmt_misc - Always freeze efivarfs during suspend/hibernate cycles - Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to actually allow mount namespace file descriptor in addition to mount namespace ids - Fix tmpfs remount when noswap is specified - Switch Landlock to iput_not_last() to remove false-positives from might_sleep() annotations in iput() - Remove dead node_to_mnt_ns() code - Ensure that per-queue kobjects are successfully created * tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: landlock: fix splats from iput() after it started calling might_sleep() fs: add iput_not_last() shmem: fix tmpfs reconfiguration (remount) when noswap is set fs/namespace: correctly handle errors returned by grab_requested_mnt_ns power: always freeze efivarfs binfmt_misc: restore write access before closing files opened by open_exec() block: add __must_check attribute to sb_min_blocksize() virtio-fs: fix incorrect check for fsvq->kobj xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super isofs: check the return value of sb_min_blocksize() in isofs_fill_super exfat: check return value of sb_min_blocksize in exfat_read_boot_sector vfat: fix missing sb_min_blocksize() return value checks mnt: Remove dead code which might prevent from building bfs: Reconstruct file type when loading from disk afs: Fix dynamic lookup to fail on cell lookup failure hostfs: Fix only passing host root in boot stage with new mount fs: Fix uninitialized 'offp' in statmount_string()
2025-11-17	block/blk-throttle: Remove throtl_slice from struct throtl_data	Guenter Roeck
	throtl_slice is now a constant. Remove the variable and use the constant directly where needed. Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17	block/blk-throttle: drop unneeded blk_stat_enable_accounting	Guenter Roeck
	After the removal of CONFIG_BLK_DEV_THROTTLING_LOW, it is no longer necessary to enable block accounting, so remove the call to blk_stat_enable_accounting(). With that, the track_bio_latency variable is no longer used and can be deleted from struct throtl_data. Also, including blk-stat.h is no longer necessary. Fixes: bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17	block/blk-throttle: Fix throttle slice time for SSDs	Guenter Roeck
	Commit d61fcfa4bb18 ("blk-throttle: choose a small throtl_slice for SSD") introduced device type specific throttle slices if BLK_DEV_THROTTLING_LOW was enabled. Commit bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") removed support for BLK_DEV_THROTTLING_LOW, but left the device type specific throttle slices in place. This effectively changed throttling behavior on systems with SSD which now use a different and non-configurable slice time compared to non-SSD devices. Practical impact is that throughput tests with low configured throttle values (65536 bps) experience less than expected throughput on SSDs, presumably due to rounding errors associated with the small throttle slice time used for those devices. The same tests pass when setting the throttle values to 65536 * 4 = 262144 bps. The original code sets the throttle slice time to DFL_THROTL_SLICE_HD if CONFIG_BLK_DEV_THROTTLING_LOW is disabled. Restore that code to fix the problem. With that, DFL_THROTL_SLICE_SSD is no longer necessary. Revert to the original code and re-introduce DFL_THROTL_SLICE to replace both DFL_THROTL_SLICE_HD and DFL_THROTL_SLICE_SSD. This effectively reverts commit d61fcfa4bb18 ("blk-throttle: choose a small throtl_slice for SSD"). While at it, also remove MAX_THROTL_SLICE since it is not used anymore. Fixes: bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17	block: consider discard merge last	Keith Busch
	If the next discard range is contiguous with the current range being considered, it's cheaper to expand the current range than to append an additional bio. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-14	block-dma: properly take MMIO path	Leon Romanovsky
	In commit eadaa8b255f3 ("dma-mapping: introduce new DMA attribute to indicate MMIO memory"), DMA_ATTR_MMIO attribute was added to describe MMIO addresses, which require to avoid any memory cache flushing, as an outcome of the discussion pointed in Link tag below. In case of PCI_P2PDMA_MAP_THRU_HOST_BRIDGE transfer, blk-mq-dm logic treated this as regular page and relied on "struct page" DMA flow. That flow performs CPU cache flushing, which shouldn't be done here, and doesn't set IOMMU_MMIO flag in DMA-IOMMU case. As a solution, let's encode peer-to-peer transaction type in NVMe IOD flags variable and provide it to blk-mq-dma API. Link: https://lore.kernel.org/all/f912c446-1ae9-4390-9c11-00dce7bf0fd3@arm.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-14	nvme-pci: migrate to dma_map_phys instead of map_page	Leon Romanovsky
	After introduction of dma_map_phys(), there is no need to convert from physical address to struct page in order to map page. So let's use it directly. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13	block: define alloc_sched_data and free_sched_data methods for kyber	Nilay Shroff
	Currently, the Kyber elevator allocates its private data dynamically in ->init_sched and frees it in ->exit_sched. However, since ->init_sched is invoked during elevator switch after acquiring both ->freeze_lock and ->elevator_lock, it may trigger the lockdep splat [1] due to dependency on pcpu_alloc_mutex. To resolve this, move the elevator data allocation and deallocation logic from ->init_sched and ->exit_sched into the newly introduced ->alloc_sched_data and ->free_sched_data methods. These callbacks are invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring that memory allocation happens safely without introducing additional locking dependencies. This change breaks the dependency chain involving pcpu_alloc_mutex and prevents the reported lockdep warning. [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Reported-by: Changhui Zhong <czhong@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13	block: use {alloc\|free}_sched data methods	Nilay Shroff
	The previous patch introduced ->alloc_sched_data and ->free_sched_data methods. This patch builds upon that by now using these methods during elevator switch and nr_hw_queue update. It's also ensured that scheduler-specific data is allocated and freed through the new callbacks outside of the ->freeze_lock and ->elevator_lock locking contexts, thereby preventing any dependency on pcpu_alloc_mutex. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13	block: introduce alloc_sched_data and free_sched_data elevator methods	Nilay Shroff
	The recent lockdep splat [1] highlights a potential deadlock risk involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_ mutex. The trace shows that the issue occurs when the Kyber scheduler allocates dynamic memory for its elevator data during initialization. To address this, introduce two new elevator operation callbacks: ->alloc_sched_data and ->free_sched_data. The subsequent patch would build upon these newly introduced methods to suppress lockdep splat[1]. [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13	block: move elevator tags into struct elevator_resources	Nilay Shroff
	This patch introduces a new structure, struct elevator_resources, to group together all elevator-related resources that share the same lifetime. As a first step, this change moves the elevator tag pointer from struct elv_change_ctx into the new struct elevator_resources. Additionally, rename blk_mq_alloc_sched_tags_batch() and blk_mq_free_sched_tags_batch() to blk_mq_alloc_sched_res_batch() and blk_mq_free_sched_res_batch(), respectively. Introduce two new wrapper helpers, blk_mq_alloc_sched_res() and blk_mq_free_sched_res(), around blk_mq_alloc_sched_tags() and blk_mq_free_sched_tags(). These changes pave the way for consolidating the allocation and freeing of elevator-specific resources into common helper functions. This refactoring improves encapsulation and prepares the code for future extensions, allowing additional elevator-specific data to be added to struct elevator_resources without cluttering struct elv_change_ctx. Subsequent patches will extend struct elevator_resources to include other elevator-related data. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13	block: unify elevator tags and type xarrays into struct elv_change_ctx	Nilay Shroff
	Currently, the nr_hw_queues update path manages two disjoint xarrays — one for elevator tags and another for elevator type — both used during elevator switching. Maintaining these two parallel structures for the same purpose adds unnecessary complexity and potential for mismatched state. This patch unifies both xarrays into a single structure, struct elv_change_ctx, which holds all per-queue elevator change context. A single xarray, named elv_tbl, now maps each queue (q->id) in a tagset to its corresponding elv_change_ctx entry, encapsulating the elevator tags, type and name references. This unification simplifies the code, improves maintainability, and clarifies ownership of per-queue elevator state. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13	block: fix NULL pointer dereference in disk_report_zones()	Damien Le Moal
	Commit 2284eec5053d ("block: introduce blkdev_get_zone_info()") introduced the report_active field in struct blk_report_zones_args so that open and closed zones can be reported with the condition BLK_ZONE_COND_ACTIVE in the case of a cached report zone. However, the args pointer to a struct blk_report_zones_args that is passed to disk_report_zones() can be NULL, e.g. in the case of internal report zones operations for device mapper zoned targets. Fix disk_report_zones() to make sure to check that the args is not null before updating a zone condition for cached zone reports. Fixes: 2284eec5053d ("block: introduce blkdev_get_zone_info()") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13	block: fix NULL pointer dereference in blk_zone_reset_all_bio_endio()	Damien Le Moal
	For zoned block devices that do not need zone write plugs (e.g. most device mapper devices that support zones), the disk hash table of zone write plugs is NULL. For such devices, blk_zone_reset_all_bio_endio() should not attempt to scan this has table as that causes a NULL pointer dereference. Fix this by checking that the disk does have zone write plugs using the atomic counter. This is equivalent to checking for a non-NULL hash table but has the advantage to also speed up the execution of blk_zone_reset_all_bio_endio() for devices that do use zone write plugs but do not have any plug in the hash table (e.g. a disk with only full zones). Fixes: efae226c2ef1 ("block: handle zone management operations completions") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12	blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller	Bart Van Assche
	Move the following code into the only caller of disk_zone_wplug_add_bio(): - The code for clearing the REQ_NOWAIT flag. - The code that sets the BLK_ZONE_WPLUG_PLUGGED flag. - The disk_zone_wplug_schedule_bio_work() call. This patch moves all code that is related to REQ_NOWAIT or to bio scheduling into a single function. Additionally, the 'schedule_bio_work' variable is removed. No functionality has been changed. Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlmoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12	blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking	Bart Van Assche
	Document that all callers hold this lock because the code in disk_zone_wplug_schedule_bio_work() depends on this. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12	blk-zoned: Fix a typo in a source code comment	Bart Van Assche
	Remove a superfluous parenthesis that was introduced by commit fa8555630b32 ("blk-zoned: Improve the queue reference count strategy documentation"). Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-12	bdev: add hint prints in sb_set_blocksize() for LBS dependency on THP	Baokun Li
	Support for block sizes greater than the page size depends on large folios, which in turn require CONFIG_TRANSPARENT_HUGEPAGE to be enabled. Because the code is wrapped in multiple layers of abstraction, this dependency is rather obscure, so users may not realize it and may be unsure how to enable LBS. As suggested by Theodore, I have added hint messages in sb_set_blocksize so that users can distinguish whether a mount failure with block size larger than page size is due to lack of filesystem support or the absence of CONFIG_TRANSPARENT_HUGEPAGE. Suggested-by: Theodore Ts'o <tytso@mit.edu> Link: https://patch.msgid.link/20251110043226.GD2988753@mit.edu Signed-off-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20251110124714.1329978-1-libaokun@huaweicloud.com Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11	blk-mq-dma: fix kernel-doc function name for integrity DMA iterator	Kriish Sharma
	Documentation build reported: Warning: block/blk-mq-dma.c:373 expecting prototype for blk_rq_integrity_dma_map_iter_start(). Prototype was for blk_rq_integrity_dma_map_iter_next() instead The kernel-doc comment above `blk_rq_integrity_dma_map_iter_next()` used the wrong function name (`blk_rq_integrity_dma_map_iter_start`) in its header. This patch corrects the function name in the kernel-doc block to match the actual implementation, ensuring clean documentation builds. Fixes: fec9b16dc555 ("blk-mq-dma: add scatter-less integrity data DMA mapping") Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11	block: fix merging data-less bios	Keith Busch
	The data segment gaps the block layer tracks doesn't apply to bio's that don't have data. Skip calculating this to fix a NULL pointer access. Fixes: 2f6b2565d43cdb5 ("block: accumulate memory segment gaps per bio") Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11	block: add lockdep to queue_limits_commit_update()	Chaitanya Kulkarni
	queue_limits_commit_update() expects q->limits_lock to be held by the caller (via queue_limits_start_update()). The API pattern is: lim = queue_limits_start_update(q); /* acquires lock / / modify lim / queue_limits_commit_update(q, &lim); / releases lock / OR queue_limits_commit_update_frozen(q, &lim); lim = queue_limits_start_update(q); / acquires lock / queue_limits_commit_update(q, &lim); / releases lock */ Add lockdep_assert_held() to report incorrect API usage. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-08	block: clean up indentation in blk_rq_map_iter_init()	Caleb Sander Mateos
	blk_rq_map_iter_init() has one line with 7 spaces of indentation and another that mixes 1 tab and 8 spaces. Convert both to tabs. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07	block: introduce bdev_zone_start()	Damien Le Moal
	Introduce the function bdev_zone_start() as a more explicit (and clear) replacement for ALIGN_DOWN() to get the start sector of a zone containing a particular sector of a zoned block device. Use this new helper in blkdev_get_zone_info() and blkdev_report_zones_cached(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07	block: refactor disk_zone_wplug_sync_wp_offset()	Damien Le Moal
	The helper function blk_zone_wp_offset() is called from disk_zone_wplug_sync_wp_offset(), and again called from blk_revalidate_seq_zone() right after the call to disk_zone_wplug_sync_wp_offset(). Change disk_zone_wplug_sync_wp_offset() to return the value of obtained with blk_zone_wp_offset() to avoid this double call, which simplifies a little blk_revalidate_seq_zone(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07	block: improve blk_zone_wp_offset()	Damien Le Moal
	blk_zone_wp_offset() is always called with a struct blk_zone obtained from the device, that is, it will never see the BLK_ZONE_COND_ACTIVE condition. However, handling this condition makes this function more solid and will also avoid issues when propagating cached report requests to underlying stacked devices is implemented. Add BLK_ZONE_COND_ACTIVE as a new case in blk_zone_wp_offset() switch. Also while at it, change the handling of the full condition to return UINT_MAX for the zone write pointer to reflect the fact that the write pointer of a full zone is invalid. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-07	block: don't return 1 for the fallback case in blkdev_get_zone_info	Christoph Hellwig
	blkdev_do_report_zones returns the number of reported zones, but blkdev_get_zone_info returns 0 or an errno. Translate to the expected return value in blkdev_report_zone_fallback. Fixes: b037d41762fd ("block: introduce blkdev_get_zone_info()") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06	block: accumulate memory segment gaps per bio	Keith Busch
	The blk-mq dma iterator has an optimization for requests that align to the device's iommu merge boundary. This boundary may be larger than the device's virtual boundary, but the code had been depending on that queue limit to know ahead of time if the request is guaranteed to align to that optimization. Rather than rely on that queue limit, which many devices may not report, save the lowest set bit of any boundary gap between each segment in the bio while checking the segments. The request stores the value for merging and quickly checking per io if the request can use iova optimizations. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06	block: fix cached zone reporting after zone append was used	Christoph Hellwig
	No zone plugs are allocated when a zone is opened by calling Zone Append on it. This makes the cached zone reporting report incorrectly empty zones if the file system is unmounted and report zones is called after that, e.g. by xfstests test cases using the scratch device. Fix this by recording if zone append was used on a device, and disable cached reporting for the device until a ZONE_RESET_ALL happens that guarantees all zones are empty. We could probably do even better using a per-zone flag, but the practical use cache for zone reporting after the initial mount are rather limited, so let's keep things simple for now. Fixes: 31f0656a4ab7 ("block: introduce blkdev_report_zones_cached()") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06	block: don't leak disk->zones_cond for !disk_need_zone_resources	Christoph Hellwig
	disk->zones_cond is allocated for all zoned devices, but disk_free_zone_resources skips it when the zone write plug hash is not allocated, leaking the allocation for non-mq devices that don't emulate zone append. This is reported by kmemleak-enabled xfstests for various tests that use simple device mapper targets. Fix this by moving all code that requires writes plugs from disk_free_zone_resources into disk_destroy_zone_wplugs_hash_table and executing the rest of the code, including the disk->zones_cond freeing unconditionally. Fixes: 6e945ffb6555 ("block: use zone condition to determine conventional zones") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: add zone write plug condition to debugfs zone_wplugs	Damien Le Moal
	Modify queue_zone_wplug_show() to include the condition of a zone write plug to the zone_wplugs debugfs attribute of a zoned block device. To improve readability and ease of use, rather than the zone condition raw value, the zone condition name is given using blk_zone_cond_str(). Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: improve zone_wplugs debugfs attribute output	Damien Le Moal
	Make the output of the zone_wplugs debugfs attribute file more easily readable by adding the name of the zone write plugs fields in the output. No functional changes. Suggested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: introduce BLKREPORTZONESV2 ioctl	Damien Le Moal
	Introduce the new BLKREPORTZONESV2 ioctl command to allow user applications access to the fast zone report implemented by blkdev_report_zones_cached(). This new ioctl is defined as number 142 and is documented in include/uapi/linux/fs.h. Unlike the existing BLKREPORTZONES ioctl, this new ioctl uses the flags field of struct blk_zone_report also as an input. If the user sets the BLK_ZONE_REP_CACHED flag as an input, then blkdev_report_zones_cached() is used to generate the zone report using cached zone information. If this flag is not set, then BLKREPORTZONESV2 behaves in the same manner as BLKREPORTZONES and the zone report is generated by accessing the zoned device. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: introduce blkdev_report_zones_cached()	Damien Le Moal
	Introduce the function blkdev_report_zones_cached() to provide a fast report zone built using the blkdev_get_zone_info() function, which gets zone information from a disk zones_cond array or zone write plugs. For a large capacity SMR drive, such fast report zone can be completed in a few milliseconds compared to several seconds completion times when the report zone is obtained from the device. The zone report is built in the same manner as with the regular blkdev_report_zones() function, that is, the first zone reported is the one containing the specified start sector and the report is limited to the specified number of zones (nr_zones argument). The information for each zone in the report is obtained using blkdev_get_zone_info(). For zoned devices that do not use zone write plug resources, using blkdev_get_zone_info() is inefficient as the zone report would be very slow, generated one zone at a time. To avoid this, blkdev_report_zones_cached() falls back to calling blkdev_do_report_zones() to execute a regular zone report. In this case, the .report_active field of struct blk_report_zones_args is set to true to report zone conditions using the BLK_ZONE_COND_ACTIVE condition in place of the implicit open, explicit open and closed conditions. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: introduce blkdev_get_zone_info()	Damien Le Moal
	Introduce the function blkdev_get_zone_info() to obtain a single zone information from cached zone data, that is, either from the zone write plug for the target zone if it exists and from the disk zones_cond array otherwise. Since sequential zones that do not have a zone write plug are either full, empty or in a bad state (read-only or offline), the zone write pointer can be inferred from the zone condition cached in the disk zones_cond array. For sequential zones that have a zone write plug, the zone condition and zone write pointer are obtained from the condition and write pointer offset managed with the zone write plug. This allows obtaining the information for a zone much more quickly than having to execute a report zones command on the device. blkdev_get_zone_info() falls back to using a regular zone report if the target zone is flagged as needing an update with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag, or if the target device does not use zone write plugs (i.e. a device mapper device). In this case, the new function blkdev_report_zone_fallback() is used and the zone condition is reported consistantly with the cahced report, that is, the BLK_ZONE_COND_ACTIVE condition is used in place of the implicit open, explicit open and closed conditions. This is achieved by adding the .report_active field to struct blk_report_zones_args and by having disk_report_zone() sets the correct zone condition if .report_active is true. In preparation for using blkdev_get_zone_info() in upcoming file systems changes, also export this function as a GPL symbol. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: refactor blkdev_report_zones() code	Damien Le Moal
	In preparation for implementing cached report zone, split the main part of the code of blkdev_report_zones() into the helper function blkdev_do_report_zones(), with this new helper taking as argument a struct blk_report_zones_args pointer instead of a report callback function and its private argument. No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: track zone conditions	Damien Le Moal
	The function blk_revalidate_zone_cond() already caches the condition of all zones of a zoned block device in the zones_cond array of a gendisk. However, the zone conditions are updated only when the device is scanned or revalidated. Implement tracking of the runtime changes to zone conditions using the new cond field in struct blk_zone_wplug. The size of this structure remains 112 Bytes as the new field replaces the 4 Bytes padding at the end of the structure. Beause zones that do not have a zone write plug can be in the empty, implicit open, explicit open or full condition, the zones_cond array of a disk is used to track the conditions, of zones that do not have a zone write plug. The condition of such zone is updated in the disk zones_cond array when a zone reset, reset all or finish operation is executed, and also when a zone write plug is removed from the disk hash table when the zone becomes full. Since a device may automatically close an implicitly open zone when writing to an empty or closed zone, if the total number of open zones has reached the device limit, the BLK_ZONE_COND_IMP_OPEN and BLK_ZONE_COND_CLOSED zone conditions cannot be precisely tracked. To overcome this, the zone condition BLK_ZONE_COND_ACTIVE is introduced to represent a zone that has the condition BLK_ZONE_COND_IMP_OPEN, BLK_ZONE_COND_EXP_OPEN or BLK_ZONE_COND_CLOSED. This follows the definition of an active zone as defined in the NVMe Zoned Namespace specifications. As such, for a zoned device that has a limit on the maximum number of open zones, we will never have more zones in the BLK_ZONE_COND_ACTIVE condition than the device limit. This is compatible with the SCSI ZBC and ATA ZAC specifications for SMR HDDs as these devices do not have a limit on the number of active zones. The function disk_zone_wplug_set_wp_offset() is modified to use the new helper disk_zone_wplug_update_cond() to update a zone write plug condition whenever a zone write plug write offset is updated on submission or merging of write BIOs to a zone. The functions blk_zone_reset_bio_endio(), blk_zone_reset_all_bio_endio() and blk_zone_finish_bio_endio() are modified to update the condition of the zones targeted by reset, reset_all and finish operations, either using though disk_zone_wplug_set_wp_offset() for zones that have a zone write plug, or using the disk_zone_set_cond() helper to update the zones_cond array of the disk for zones that do not have a zone write plug. When a zone write plug is removed from the disk hash table (when the zone becomes empty or full), the condition of struct blk_zone_wplug is used to update the disk zones_cond array. Conversely, when a zone write plug is added to the disk hash table, the zones_cond array is used to initialize the zone write plug condition. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: use zone condition to determine conventional zones	Damien Le Moal
	The conv_zones_bitmap field of struct gendisk is used to define a bitmap to identify the conventional zones of a zoned block device. The bit for a zone is set in this bitmap if the zone is a conventional one, that is, if the zone type is BLK_ZONE_TYPE_CONVENTIONAL. For such zone, this always corresponds to the zone condition BLK_ZONE_COND_NOT_WP. In other words, conv_zones_bitmap tracks a single condition of the zones of a zoned block device. In preparation for tracking more zone conditions, change conv_zones_bitmap into an array of zone conditions, using 1 byte per zone. This increases the memory usage from 1 bit per zone to 1 byte per zone, that is, from 16 KiB to about 100 KiB for a 30 TB SMR HDD with 256 MiB zones. This is a trade-off to allow fast cached report zones later on top of this change. Rename the conv_zones_bitmap field of struct gendisk to zones_cond. Add a blk_revalidate_zone_cond() function to initialize the zones_cond array of a disk during device scan and to update it on device revalidation. Move the allocation of the zones_cond array to disk_revalidate_zone_resources(), making sure that this array is always allocated, even for devices that do not need zone write plugs (zone resources), to ensure that bdev_zone_is_seq() can be re-implemented to use the zone condition array in place of the conv zones bitmap. Finally, the function bdev_zone_is_seq() is rewritten to use a test on the condition of the target zone. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: reorganize struct blk_zone_wplug	Damien Le Moal
	Reorganize the fields of struct blk_zone_wplug to remove a hole after the wp_offset field and avoid having the bio_work structure split between 2 cache lines. No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: introduce disk_report_zone()	Damien Le Moal
	Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer alignment") introduced an indirect call for the callback function of a report zones executed with blkdev_report_zones(). This is necessary so that the function disk_zone_wplug_sync_wp_offset() can be called to refresh a zone write plug zone write pointer offset after a write error. However, this solution makes following the path of a zone information harder to understand. Clean this up by introducing the new blk_report_zones_args structure to define a zone report callback and its private data and introduce the helper function disk_report_zone() which calls both disk_zone_wplug_sync_wp_offset() and the zone report user callback function for all zones of a zone report. This helper function must be called by all block device drivers that implement the report zones block operation in order to correctly report a zone information. All block device drivers supporting the report_zones block operation are updated to use this new scheme. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: cleanup blkdev_report_zones()	Damien Le Moal
	The variable capacity is used only in one place and so can be removed and get_capacity(disk) used directly instead. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: freeze queue when updating zone resources	Damien Le Moal
	Modify disk_update_zone_resources() to freeze the device queue before updating the number of zones, zone capacity and other zone related resources. The locking order resulting from the call to queue_limits_commit_update_frozen() is preserved, that is, the queue limits lock is first taken by calling queue_limits_start_update() before freezing the queue, and the queue is unfrozen after executing queue_limits_commit_update(), which replaces the call to queue_limits_commit_update_frozen(). This change ensures that there are no in-flights I/Os when the zone resources are updated due to a zone revalidation. In case of error when the limits are applied, directly call disk_free_zone_resources() from disk_update_zone_resources() while the disk queue is still frozen to avoid needing to freeze & unfreeze the queue again in blk_revalidate_disk_zones(), thus simplifying that function code a little. Fixes: 0b83c86b444a ("block: Prevent potential deadlock in blk_revalidate_disk_zones()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05	block: handle zone management operations completions	Damien Le Moal
	The functions blk_zone_wplug_handle_reset_or_finish() and blk_zone_wplug_handle_reset_all() both modify the zone write pointer offset of zone write plugs that are the target of a reset, reset all or finish zone management operation. However, these functions do this modification before the BIO is executed. So if the zone operation fails, the modified zone write pointer offsets become invalid. Avoid this by modifying the zone write pointer offset of a zone write plug that is the target of a zone management operation when the operation completes. To do so, modify blk_zone_bio_endio() to call the new function blk_zone_mgmt_bio_endio() which in turn calls the functions blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or blk_zone_finish_bio_endio() depending on the operation of the completed BIO, to modify a zone write plug write pointer offset accordingly. These functions are called only if the BIO execution was successful. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>