kernel/block, branch linux-5.19.y

blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init()

2022-10-24T07:58:30Z

commit 285febabac4a16655372d23ff43e89ff6f216691 upstream. commit 8c5035dfbb94 ("blk-wbt: call rq_qos_add() after wb_normal is initialized") moves wbt_set_write_cache() before rq_qos_add(), which is wrong because wbt_rq_qos() is still NULL. Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc' directly. Noted that this patch also remove the redundant setting of 'rab->wc'. Fixes: 8c5035dfbb94 ("blk-wbt: call rq_qos_add() after wb_normal is initialized") Reported-by: kernel test robot Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.com Signed-off-by: Yu Kuai Reviewed-by: Ming Lei Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

blk-mq: use quiesced elevator switch when reinitializing queues

2022-10-24T07:58:28Z

[ Upstream commit 8237c01f1696bc53c470493bf1fe092a107648a6 ] The hctx's run_work may be racing with the elevator switch when reinitializing hardware queues. The queue is merely frozen in this context, but that only prevents requests from allocating and doesn't stop the hctx work from running. The work may get an elevator pointer that's being torn down, and can result in use-after-free errors and kernel panics (example below). Use the quiesced elevator switch instead, and make the previous one static since it is now only used locally. nvme nvme0: resetting controller nvme nvme0: 32/0/0 default/read/poll queues BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0 Oops: 0000 [#1] SMP PTI Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:kyber_has_work+0x29/0x70 ... Call Trace: __blk_mq_do_dispatch_sched+0x83/0x2b0 __blk_mq_sched_dispatch_requests+0x12e/0x170 blk_mq_sched_dispatch_requests+0x30/0x60 __blk_mq_run_hw_queue+0x2b/0x50 process_one_work+0x1ef/0x380 worker_thread+0x2d/0x3e0 Signed-off-by: Keith Busch Reviewed-by: Ming Lei Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-throttle: prevent overflow while calculating wait time

2022-10-24T07:58:25Z

[ Upstream commit 8d6bbaada2e0a65f9012ac4c2506460160e7237a ] There is a problem found by code review in tg_with_in_bps_limit() that 'bps_limit * jiffy_elapsed_rnd' might overflow. Fix the problem by calling mul_u64_u64_div_u64() instead. Signed-off-by: Yu Kuai Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20220829022240.3348319-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

blk-wbt: call rq_qos_add() after wb_normal is initialized

2022-10-24T07:56:59Z

commit 8c5035dfbb9475b67c82b3fdb7351236525bf52b upstream. Our test found a problem that wbt inflight counter is negative, which will cause io hang(noted that this problem doesn't exist in mainline): t1: device create t2: issue io add_disk blk_register_queue wbt_enable_default wbt_init rq_qos_add // wb_normal is still 0 /* * in mainline, disk can't be opened before * bdev_add(), however, in old kernels, disk * can be opened before blk_register_queue(). */ blkdev_issue_flush // disk size is 0, however, it's not checked submit_bio_wait submit_bio blk_mq_submit_bio rq_qos_throttle wbt_wait bio_to_wbt_flags rwb_enabled // wb_normal is 0, inflight is not increased wbt_queue_depth_changed(&rwb->rqos); wbt_update_limits // wb_normal is initialized rq_qos_track wbt_track rq->wbt_flags |= bio_to_wbt_flags(rwb, bio); // wb_normal is not 0，wbt_flags will be set t3: io completion blk_mq_free_request rq_qos_done wbt_done wbt_is_tracked // return true __wbt_done wbt_rqw_done atomic_dec_return(&rqw->inflight); // inflight is decreased commit 8235b5c1e8c1 ("block: call bdev_add later in device_add_disk") can avoid this problem, however it's better to fix this problem in wbt: 1) Lower kernel can't backport this patch due to lots of refactor. 2) Root cause is that wbt call rq_qos_add() before wb_normal is initialized. Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Cc: Signed-off-by: Yu Kuai Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

blk-throttle: fix that io throttle can only work for single bio

2022-10-24T07:56:59Z

commit 320fb0f91e55ba248d4bad106b408e59099cfa89 upstream. Test scripts: cd /sys/fs/cgroup/blkio/ echo "8:0 1024" > blkio.throttle.write_bps_device echo $$ > cgroup.procs dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct & Test result: 10240 bytes (10 kB, 10 KiB) copied, 10.0134 s, 1.0 kB/s 10240 bytes (10 kB, 10 KiB) copied, 10.0135 s, 1.0 kB/s The problem is that the second bio is finished after 10s instead of 20s. Root cause: 1) second bio will be flagged: __blk_throtl_bio while (true) { ... if (sq->nr_queued[rw]) -> some bio is throttled already break }; bio_set_flag(bio, BIO_THROTTLED); -> flag the bio 2) flagged bio will be dispatched without waiting: throtl_dispatch_tg tg_may_dispatch tg_with_in_bps_limit if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) *wait = 0; -> wait time is zero return true; commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit") support to count split bios for iops limit, thus it adds flagged bio checking in tg_with_in_bps_limit() so that split bios will only count once for bps limit, however, it introduce a new problem that io throttle won't work if multiple bios are throttled. In order to fix the problem, handle iops/bps limit in different ways: 1) for iops limit, there is no flag to record if the bio is throttled, and iops is always applied. 2) for bps limit, original bio will be flagged with BIO_BPS_THROTTLED, and io throttle will ignore bio with the flag. Noted this patch also remove the code to set flag in __bio_clone(), it's introduced in commit 111be8839817 ("block-throttle: avoid double charge"), and author thinks split bio can be resubmited and throttled again, which is wrong because split bio will continue to dispatch from caller. Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit") Cc: Signed-off-by: Yu Kuai Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20220829022240.3348319-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

Revert "block: freeze the queue earlier in del_gendisk"

2022-09-28T09:32:28Z

commit 4c66a326b5ab784cddd72de07ac5b6210e9e1b06 upstream. This reverts commit a09b314005f3a0956ebf56e01b3b80339df577cc. Dusty Mabe reported consistent hang during CoreOS shutdown with a MD RAID1 setup. Although apparently similar hangs happened before, and this patch most likely is not the root cause it made it much more severe. Revert it until we can figure out what is going on with the md driver. Signed-off-by: Christoph Hellwig Link: https://lore.kernel.org/r/20220919144049.978907-1-hch@lst.de Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: Do not call blk_put_queue() if gendisk allocation fails

2022-09-28T09:32:23Z

commit aa0c680c3aa96a5f9f160d90dd95402ad578e2b0 upstream. Commit 6f8191fdf41d ("block: simplify disk shutdown") removed the call to blk_get_queue() during gendisk allocation but missed to remove the corresponding cleanup code blk_put_queue() for it. Thus, if the gendisk allocation fails, the request_queue refcount gets decremented and reaches 0, causing blk_mq_release() to be called with a hctx still alive. That triggers a WARNING report, as found by syzkaller: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 23016 at block/blk-mq.c:3881 blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881 [...] stripped RIP: 0010:blk_mq_release+0xf8/0x3e0 block/blk-mq.c:3881 [...] stripped Call Trace: blk_release_queue+0x153/0x270 block/blk-sysfs.c:780 kobject_cleanup lib/kobject.c:673 [inline] kobject_release lib/kobject.c:704 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x1c8/0x540 lib/kobject.c:721 __alloc_disk_node+0x4f7/0x610 block/genhd.c:1388 __blk_mq_alloc_disk+0x13b/0x1f0 block/blk-mq.c:3961 loop_add+0x3e2/0xaf0 drivers/block/loop.c:1978 loop_control_ioctl+0x133/0x620 drivers/block/loop.c:2150 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] stripped Fixes: 6f8191fdf41d ("block: simplify disk shutdown") Reported-by: syzbot+31c9594f6e43b9289b25@syzkaller.appspotmail.com Suggested-by: Hillf Danton Signed-off-by: Rafael Mendonca Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20220811232338.254673-1-rafaelmendsr@gmail.com Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: call blk_mq_exit_queue from disk_release for never added disks

2022-09-28T09:32:23Z

commit c5db2cfc6274692d821d33b59acb6ff615e350c1 upstream. To undo the all initialization from blk_mq_init_allocated_queue in case of a probe failure where add_disk is never called we have to call blk_mq_exit_queue from put_disk. This relies on the fact that drivers always call blk_mq_free_tag_set after calling put_disk in the probe error path if they have a gendisk at all. We should be doing this in general, but can't do it for the normal teardown case (yet) as the tagset can be gone by the time the disk is released once it was added. I hope to sort this out properly eventually but for now this isolated hack will do it. Fixes: 6f8191fdf41d ("block: simplify disk shutdown") Signed-off-by: Christoph Hellwig Reviewed-by: Ming Lei Link: https://lore.kernel.org/r/20220720130541.1323531-2-hch@lst.de Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

blk-mq: fix error handling in __blk_mq_alloc_disk

2022-09-28T09:32:23Z

commit 0a3e5cc7bbfcd571a2e53779ef7d7aa3c57d5432 upstream. To fully clean up the queue if the disk allocation fails we need to call blk_mq_destroy_queue and not just blk_put_queue. Fixes: 6f8191fdf41d ("block: simplify disk shutdown") Signed-off-by: Christoph Hellwig Reviewed-by: Ming Lei Link: https://lore.kernel.org/r/20220720130541.1323531-1-hch@lst.de Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

block: simplify disk shutdown

2022-09-28T09:32:01Z

[ Upstream commit 6f8191fdf41d3a53cc1d63fe2234e812c55a0092 ] Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for all disks that do not have separately allocated queues, and thus remove the need to call blk_cleanup_queue for them. Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that this function is intended only for separately allocated blk-mq queues. This saves an extra queue freeze for devices without a separately allocated queue. Signed-off-by: Christoph Hellwig Reviewed-by: Hannes Reinecke Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de Signed-off-by: Jens Axboe Stable-dep-of: 8fe4ce5836e9 ("scsi: core: Fix a use-after-free") Signed-off-by: Sasha Levin