kernel/block/blk-mq-sched.h, branch linux-rolling-lts

blk-mq-sched: unify elevators checking for async requests

2026-03-04T12:19:38Z

[ Upstream commit 1db61b0afdd7e8aa9289c423fdff002603b520b5 ] bfq and mq-deadline consider sync writes as async requests and only reserve tags for sync reads by async_depth, however, kyber doesn't consider sync writes as async requests for now. Consider the case there are lots of dirty pages, and user use fsync to flush dirty pages. In this case sched_tags can be exhausted by sync writes and sync reads can stuck waiting for tag. Hence let kyber follow what mq-deadline and bfq did, and unify async requests checking for all elevators. Signed-off-by: Yu Kuai Reviewed-by: Nilay Shroff Reviewed-by: Hannes Reinecke Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

block: use {alloc|free}_sched data methods

2026-01-02T11:56:51Z

[ Upstream commit 0315476e78c050048e80f66334a310e5581b46bb ] The previous patch introduced ->alloc_sched_data and ->free_sched_data methods. This patch builds upon that by now using these methods during elevator switch and nr_hw_queue update. It's also ensured that scheduler-specific data is allocated and freed through the new callbacks outside of the ->freeze_lock and ->elevator_lock locking contexts, thereby preventing any dependency on pcpu_alloc_mutex. Reviewed-by: Ming Lei Reviewed-by: Yu Kuai Signed-off-by: Nilay Shroff Signed-off-by: Jens Axboe Stable-dep-of: 9869d3a6fed3 ("block: fix race between wbt_enable_default and IO submission") Signed-off-by: Sasha Levin

block: introduce alloc_sched_data and free_sched_data elevator methods

2026-01-02T11:56:51Z

[ Upstream commit 61019afdf6ac17c8e8f9c42665aa1fa82f04a3e2 ] The recent lockdep splat [1] highlights a potential deadlock risk involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_ mutex. The trace shows that the issue occurs when the Kyber scheduler allocates dynamic memory for its elevator data during initialization. To address this, introduce two new elevator operation callbacks: ->alloc_sched_data and ->free_sched_data. The subsequent patch would build upon these newly introduced methods to suppress lockdep splat[1]. [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/ Signed-off-by: Nilay Shroff Reviewed-by: Ming Lei Signed-off-by: Jens Axboe Stable-dep-of: 9869d3a6fed3 ("block: fix race between wbt_enable_default and IO submission") Signed-off-by: Sasha Levin

block: move elevator tags into struct elevator_resources

2026-01-02T11:56:50Z

[ Upstream commit 04728ce90966c54417fd8120a3820104d18ba68d ] This patch introduces a new structure, struct elevator_resources, to group together all elevator-related resources that share the same lifetime. As a first step, this change moves the elevator tag pointer from struct elv_change_ctx into the new struct elevator_resources. Additionally, rename blk_mq_alloc_sched_tags_batch() and blk_mq_free_sched_tags_batch() to blk_mq_alloc_sched_res_batch() and blk_mq_free_sched_res_batch(), respectively. Introduce two new wrapper helpers, blk_mq_alloc_sched_res() and blk_mq_free_sched_res(), around blk_mq_alloc_sched_tags() and blk_mq_free_sched_tags(). These changes pave the way for consolidating the allocation and freeing of elevator-specific resources into common helper functions. This refactoring improves encapsulation and prepares the code for future extensions, allowing additional elevator-specific data to be added to struct elevator_resources without cluttering struct elv_change_ctx. Subsequent patches will extend struct elevator_resources to include other elevator-related data. Reviewed-by: Ming Lei Reviewed-by: Yu Kuai Signed-off-by: Nilay Shroff Signed-off-by: Jens Axboe Stable-dep-of: 9869d3a6fed3 ("block: fix race between wbt_enable_default and IO submission") Signed-off-by: Sasha Levin

block: unify elevator tags and type xarrays into struct elv_change_ctx

2026-01-02T11:56:50Z

[ Upstream commit 232143b605387b372dee0ec7830f93b93df5f67d ] Currently, the nr_hw_queues update path manages two disjoint xarrays — one for elevator tags and another for elevator type — both used during elevator switching. Maintaining these two parallel structures for the same purpose adds unnecessary complexity and potential for mismatched state. This patch unifies both xarrays into a single structure, struct elv_change_ctx, which holds all per-queue elevator change context. A single xarray, named elv_tbl, now maps each queue (q->id) in a tagset to its corresponding elv_change_ctx entry, encapsulating the elevator tags, type and name references. This unification simplifies the code, improves maintainability, and clarifies ownership of per-queue elevator state. Reviewed-by: Ming Lei Reviewed-by: Yu Kuai Signed-off-by: Nilay Shroff Signed-off-by: Jens Axboe Stable-dep-of: 9869d3a6fed3 ("block: fix race between wbt_enable_default and IO submission") Signed-off-by: Sasha Levin

blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()

2025-09-10T11:25:56Z

This helper only support to allocate the default number of requests, add a new parameter to support specific number of requests. Prepare to fix potential deadlock in the case nr_requests grow. Signed-off-by: Yu Kuai Reviewed-by: Nilay Shroff Signed-off-by: Jens Axboe

blk-mq: fix elevator depth_updated method

2025-09-05T19:52:52Z

Current depth_updated has some problems: 1) depth_updated() will be called for each hctx, while all elevators will update async_depth for the disk level, this is not related to hctx; 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and this hctx update failed, q->nr_requests will not be updated, while async_depth is already updated with new nr_reqeuests in previous depth_updated(); 3) All elevators are using q->nr_requests to calculate async_depth now, however, q->nr_requests is still the old value when depth_updated() is called from blk_mq_update_nr_requests(); Those problems are first from error path, then mq-deadline, and recently for bfq and kyber, fix those problems by: - pass in request_queue instead of hctx; - move depth_updated() after q->nr_requests is updated in blk_mq_update_nr_requests(); - add depth_updated() call inside init_sched() method to initialize async_depth; - remove init_hctx() method for mq-deadline and bfq that is useless now; Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes") Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code") Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap") Signed-off-by: Yu Kuai Reviewed-by: Hannes Reinecke Reviewed-by: Li Nan Reviewed-by: Nilay Shroff Link: https://lore.kernel.org/r/20250821060612.1729939-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe

block: fix potential deadlock while running nr_hw_queue update

2025-07-30T12:20:51Z

Move scheduler tags (sched_tags) allocation and deallocation outside both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues. This change breaks the dependency chain from the percpu allocator lock to the elevator lock, helping to prevent potential deadlocks, as observed in the reported lockdep splat[1]. This commit introduces batch allocation and deallocation helpers for sched_tags, which are now used from within __blk_mq_update_nr_hw_queues routine while iterating through the tagset. With this change, all sched_tags memory management is handled entirely outside the ->elevator_lock and the ->freeze_lock context, thereby eliminating the lock dependency that could otherwise manifest during nr_hw_queues updates. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reported-by: Stefan Haberland Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Reviewed-by: Hannes Reinecke Signed-off-by: Nilay Shroff Link: https://lore.kernel.org/r/20250730074614.2537382-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe

block: fix lockdep warning caused by lock dependency in elv_iosched_store

2025-07-30T12:20:51Z

Recent lockdep reports [1] have revealed a potential deadlock caused by a lock dependency between the percpu allocator lock and the elevator lock. This issue can be avoided by ensuring that the allocation and release of scheduler tags (sched_tags) are performed outside the elevator lock. Furthermore, the queue does not need to be remain frozen during these operations. To address this, move all sched_tags allocations and deallocations outside of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of the elevator queue and its associated sched_tags is closely tied, the allocated sched_tags are now stored in the elevator queue structure. Then, during the actual elevator switch (which runs under ->freeze_lock and ->elevator_lock), the pre-allocated sched_tags are assigned to the appropriate q->hctx. Once the elevator switch is complete and the locks are released, the old elevator queue and its associated sched_tags are freed. This commit specifically addresses the allocation/deallocation of sched_ tags during elevator switching. Note that sched_tags may also be allocated in other contexts, such as during nr_hw_queues updates. Supporting that use case will require batch allocation/deallocation, which will be handled in a follow-up patch. This restructuring ensures that sched_tags memory management occurs entirely outside of the ->elevator_lock and ->freeze_lock context, eliminating the lock dependency problem seen during scheduler updates. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reported-by: Stefan Haberland Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Reviewed-by: Hannes Reinecke Signed-off-by: Nilay Shroff Link: https://lore.kernel.org/r/20250730074614.2537382-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe

blk-mq: make sure elevator callbacks aren't called for passthrough request

2023-05-19T01:42:54Z

In case of q->elevator, passthrough request can still be marked as RQF_ELV, so some elevator callbacks will be called for them. Fix this by splitting RQF_SCHED_TAGS, which is set for all requests that are issued on a queue that uses an I/O scheduler, and RQF_USE_SCHED for non-flush, non-passthrough requests on such a queue. Roughly based on two different patches from Ming Lei . Signed-off-by: Christoph Hellwig Reviewed-by: Ming Lei Link: https://lore.kernel.org/r/20230518053101.760632-4-hch@lst.de Signed-off-by: Jens Axboe