summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2022-06-14md/raid0: Ignore RAID0 layout if the second zone has only one devicePascal Hambourg
commit ea23994edc4169bd90d7a9b5908c6ccefd82fa40 upstream. The RAID0 layout is irrelevant if all members have the same size so the array has only one zone. It is *also* irrelevant if the array has two zones and the second zone has only one device, for example if the array has two members of different sizes. So in that case it makes sense to allow assembly even when the layout is undefined, like what is done when the array has only one zone. Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Pascal Hambourg <pascal@plouf.fr.eu.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14md: protect md_unregister_thread from reentrancyGuoqing Jiang
[ Upstream commit 1e267742283a4b5a8ca65755c44166be27e9aa0f ] Generally, the md_unregister_thread is called with reconfig_mutex, but raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread, so md_unregister_thread can be called simulitaneously from two call sites in theory. Then after previous commit which remove the protection of reconfig_mutex for md_unregister_thread completely, the potential issue could be worse than before. Let's take pers_lock at the beginning of function to ensure reentrancy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-09md: bcache: check the return value of kzalloc() in detached_dev_do_request()Jia-Ju Bai
commit 40f567bbb3b0639d2ec7d1c6ad4b1b018f80cf19 upstream. The function kzalloc() in detached_dev_do_request() can fail, so its return value should be checked. Fixes: bc082a55d25c ("bcache: fix inaccurate io state for detached bcache devices") Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09md: fix double free of io_acct_set biosetXiao Ni
commit 42b805af102471f53e3c7867b8c2b502ea4eef7e upstream. Now io_acct_set is alloc and free in personality. Remove the codes that free io_acct_set in md_free and md_stop. Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality) Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09md: Don't set mddev private to NULL in raid0 pers->freeXiao Ni
commit 0f2571ad7a30ff6b33cde142439f9378669f8b4f upstream. In normal stop process, it does like this: do_md_stop | __md_stop (pers->free(); mddev->private=NULL) | md_free (free mddev) __md_stop sets mddev->private to NULL after pers->free. The raid device will be stopped and mddev memory is free. But in reshape, it doesn't free the mddev and mddev will still be used in new raid. In reshape, it first sets mddev->private to new_pers and then runs old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free. The new raid can't work anymore. It will panic when dereference mddev->private because of NULL pointer dereference. It can panic like this: [63010.814972] kernel BUG at drivers/md/raid10.c:928! [63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1 [63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020 [63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10] [63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246 [63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000 [63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800 [63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000 [63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200 [63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003 [63010.906399] FS: 0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000 [63010.914485] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0 [63010.927363] Call Trace: [63010.929822] ? bio_reset+0xe/0x40 [63010.933144] ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10] [63010.938629] raid10_sync_request+0x756/0x1610 [raid10] [63010.943770] md_do_sync.cold+0x3e4/0x94c [63010.947698] md_thread+0xab/0x160 [63010.951024] ? md_write_inc+0x50/0x50 [63010.954688] kthread+0x149/0x170 [63010.957923] ? set_kthread_struct+0x40/0x40 [63010.962107] ret_from_fork+0x22/0x30 Removing the code that sets mddev->private to NULL in raid0 can fix problem. Fixes: 0c031fd37f69 (md: Move alloc/free acct bioset in to personality) Reported-by: Fine Fan <ffan@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09bcache: avoid journal no-space deadlock by reserving 1 journal bucketColy Li
commit 32feee36c30ea06e38ccb8ae6e5c44c6eec790a6 upstream. The journal no-space deadlock was reported time to time. Such deadlock can happen in the following situation. When all journal buckets are fully filled by active jset with heavy write I/O load, the cache set registration (after a reboot) will load all active jsets and inserting them into the btree again (which is called journal replay). If a journaled bkey is inserted into a btree node and results btree node split, new journal request might be triggered. For example, the btree grows one more level after the node split, then the root node record in cache device super block will be upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no space in journal buckets, the journal replay has to wait for new journal bucket to be reclaimed after at least one journal bucket replayed. This is one example that how the journal no-space deadlock happens. The solution to avoid the deadlock is to reserve 1 journal bucket in run time, and only permit the reserved journal bucket to be used during cache set registration procedure for things like journal replay. Then the journal space will never be fully filled, there is no chance for journal no-space deadlock to happen anymore. This patch adds a new member "bool do_reserve" in struct journal, it is inititalized to 0 (false) when struct journal is allocated, and set to 1 (true) by bch_journal_space_reserve() when all initialization done in run_cache_set(). In the run time when journal_reclaim() tries to allocate a new journal bucket, free_journal_buckets() is called to check whether there are enough free journal buckets to use. If there is only 1 free journal bucket and journal->do_reserve is 1 (true), the last bucket is reserved and free_journal_buckets() will return 0 to indicate no free journal bucket. Then journal_reclaim() will give up, and try next time to see whetheer there is free journal bucket to allocate. By this method, there is always 1 jouranl bucket reserved in run time. During the cache set registration, journal->do_reserve is 0 (false), so the reserved journal bucket can be used to avoid the no-space deadlock. Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()Coly Li
commit 80db4e4707e78cb22287da7d058d7274bd4cb370 upstream. After making bch_sectors_dirty_init() being multithreaded, the existing incremental dirty sector counting in bch_root_node_dirty_init() doesn't release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME) bkeys. Because a read lock is added on btree root node to prevent the btree to be split during the dirty sectors counting, other I/O requester has no chance to gain the write lock even restart bcache_btree(). That is to say, the incremental dirty sectors counting is incompatible to the multhreaded bch_sectors_dirty_init(). We have to choose one and drop another one. In my testing, with 512 bytes random writes, I generate 1.2T dirty data and a btree with 400K nodes. With single thread and incremental dirty sectors counting, it takes 30+ minites to register the backing device. And with multithreaded dirty sectors counting, the backing device registration can be accomplished within 2 minutes. The 30+ minutes V.S. 2- minutes difference makes me decide to keep multithreaded bch_sectors_dirty_init() and drop the incremental dirty sectors counting. This is what this patch does. But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys iterated. This is to avoid the watchdog reports a bogus soft lockup warning. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09bcache: improve multithreaded bch_sectors_dirty_init()Coly Li
commit 4dc34ae1b45fe26e772a44379f936c72623dd407 upstream. Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") makes bch_sectors_dirty_init() to be much faster when counting dirty sectors by iterating all dirty keys in the btree. But it isn't in ideal shape yet, still can be improved. This patch does the following changes to improve current parallel dirty keys iteration on the btree, - Add read lock to root node when multiple threads iterating the btree, to prevent the root node gets split by I/Os from other registered bcache devices. - Remove local variable "char name[32]" and generate kernel thread name string directly when calling kthread_run(). - Allocate "struct bch_dirty_init_state state" directly on stack and avoid the unnecessary dynamic memory allocation for it. - Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed. - Increase &state->started to count created kernel thread after it succeeds to create. - When wait for all dirty key counting threads to finish, use wait_event() to replace wait_event_interruptible(). With the above changes, the code is more clear, and some potential error conditions are avoided. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09bcache: improve multithreaded bch_btree_check()Coly Li
commit 622536443b6731ec82c563aae7807165adbe9178 upstream. Commit 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") makes bch_btree_check() to be much faster when checking all btree nodes during cache device registration. But it isn't in ideal shap yet, still can be improved. This patch does the following thing to improve current parallel btree nodes check by multiple threads in bch_btree_check(), - Add read lock to root node while checking all the btree nodes with multiple threads. Although currently it is not mandatory but it is good to have a read lock in code logic. - Remove local variable 'char name[32]', and generate kernel thread name string directly when calling kthread_run(). - Allocate local variable "struct btree_check_state check_state" on the stack and avoid unnecessary dynamic memory allocation for it. - Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed. - Increase check_state->started to count created kernel thread after it succeeds to create. - When wait for all checking kernel threads to finish, use wait_event() to replace wait_event_interruptible(). With this change, the code is more clear, and some potential error conditions are avoided. Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09md: fix an incorrect NULL check in md_reload_sbXiaomeng Tong
commit 64c54d9244a4efe9bc6e9c98e13c4bbb8bb39083 upstream. The bug is here: if (!rdev || rdev->desc_nr != nr) { The list iterator value 'rdev' will *always* be set and non-NULL by rdev_for_each_rcu(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found (In fact, it will be a bogus pointer to an invalid struct object containing the HEAD). Otherwise it will bypass the check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'pdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: 70bcecdb1534 ("md-cluster: Improve md_reload_sb to be less error prone") Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09md: fix an incorrect NULL check in does_sb_need_changingXiaomeng Tong
commit fc8738343eefc4ea8afb6122826dea48eacde514 upstream. The bug is here: if (!rdev) The list iterator value 'rdev' will *always* be set and non-NULL by rdev_for_each(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found. Otherwise it will bypass the NULL check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'rdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: 2aa82191ac36 ("md-cluster: Perform a lazy update") Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09md/bitmap: don't set sb values if can't pass sanity checkHeming Zhao
[ Upstream commit e68cb83a57a458b01c9739e2ad9cb70b04d1e6d2 ] If bitmap area contains invalid data, kernel will crash then mdadm triggers "Segmentation fault". This is cluster-md speical bug. In non-clustered env, mdadm will handle broken metadata case. In clustered array, only kernel space handles bitmap slot info. But even this bug only happened in clustered env, current sanity check is wrong, the code should be changed. How to trigger: (faulty injection) dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm -Ss echo aaa > magic.txt == below modifying slot 2 bitmap data == dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize mdadm -A /dev/md0 /dev/sda /dev/sdb == kernel crashes. mdadm outputs "Segmentation fault" == Reason of kernel crash: In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T() trigger "divide error". Crash log: kernel: md: md0 stopped. kernel: md/raid1:md0: not clean -- starting background reconstruction kernel: md/raid1:md0: active with 2 out of 2 mirrors kernel: dlm: ... ... kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1 kernel: md0: invalid bitmap file superblock: bad magic kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2 kernel: md-cluster: Could not gather bitmaps from slot 2 kernel: divide error: 0000 [#1] SMP NOPTI kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246 kernel: ... ... kernel: Call Trace: kernel: ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0] kernel: md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a] kernel: load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0] kernel: md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a] kernel: do_md_run+0x30/0x100 [md_mod 24ea..d3a] kernel: md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a] kernel: ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a] kernel: ? blkdev_ioctl+0xb1/0x2b0 kernel: block_ioctl+0x3b/0x40 kernel: __x64_sys_ioctl+0x7f/0xb0 kernel: do_syscall_64+0x59/0x80 kernel: ? exit_to_user_mode_prepare+0x1ab/0x230 kernel: ? syscall_exit_to_user_mode+0x18/0x40 kernel: ? do_syscall_64+0x69/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae kernel: RIP: 0033:0x7f4a15fa722b kernel: ... ... kernel: ---[ end trace 8afa7612f559c868 ]--- kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-06-06raid5: introduce MD_BROKENMariusz Tkaczyk
commit 57668f0a4cc4083a120cc8c517ca0055c4543b59 upstream. Raid456 module had allowed to achieve failed state. It was fixed by fb73b357fb9 ("raid5: block failing device if raid will be failed"). This fix introduces a bug, now if raid5 fails during IO, it may result with a hung task without completion. Faulty flag on the device is necessary to process all requests and is checked many times, mainly in analyze_stripe(). Allow to set faulty on drive again and set MD_BROKEN if raid is failed. As a result, this level is allowed to achieve failed state again, but communication with userspace (via -EBUSY status) will be preserved. This restores possibility to fail array via #mdadm --set-faulty command and will be fixed by additional verification on mdadm side. Reproduction steps: mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean mkfs.xfs /dev/md126 -f mount /dev/md126 /mnt/root/ fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 & echo 1 > /sys/block/nvme2n1/device/device/remove echo 1 > /sys/block/nvme1n1/device/device/remove [ 1475.787779] Call Trace: [ 1475.793111] __schedule+0x2a6/0x700 [ 1475.799460] schedule+0x38/0xa0 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456] [ 1475.813856] ? finish_wait+0x80/0x80 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456] [ 1475.828281] ? finish_wait+0x80/0x80 [ 1475.834727] ? finish_wait+0x80/0x80 [ 1475.841127] ? finish_wait+0x80/0x80 [ 1475.847480] md_handle_request+0x119/0x190 [ 1475.854390] md_make_request+0x8a/0x190 [ 1475.861041] generic_make_request+0xcf/0x310 [ 1475.868145] submit_bio+0x3c/0x160 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390 [ 1475.889149] iomap_apply+0xff/0x310 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.909974] iomap_dio_rw+0x2f2/0x490 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.923680] ? atime_needs_update+0x77/0xe0 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs] [ 1475.953403] aio_read+0xd5/0x180 [ 1475.959395] ? _cond_resched+0x15/0x30 [ 1475.965907] io_submit_one+0x20b/0x3c0 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180 [ 1475.979335] ? do_io_getevents+0x7c/0xc0 [ 1475.986009] do_syscall_64+0x5b/0x1a0 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 1476.000255] RIP: 0033:0x7f11fc27978d [ 1476.006631] Code: Bad RIP value. [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds. Cc: stable@vger.kernel.org Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-06dm verity: set DM_TARGET_IMMUTABLE feature flagSarthak Kukreti
commit 4caae58406f8ceb741603eee460d79bacca9b1b5 upstream. The device-mapper framework provides a mechanism to mark targets as immutable (and hence fail table reloads that try to change the target type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's feature flags to prevent switching the verity target with a different target type. Fixes: a4ffc152198e ("dm: add verity target") Cc: stable@vger.kernel.org Signed-off-by: Sarthak Kukreti <sarthakkukreti@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-06dm stats: add cond_resched when looping over entriesMikulas Patocka
commit bfe2b0146c4d0230b68f5c71a64380ff8d361f8b upstream. dm-stats can be used with a very large number of entries (it is only limited by 1/4 of total system memory), so add rescheduling points to the loops that iterate over the entries. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-06dm crypt: make printing of the key constant-timeMikulas Patocka
commit 567dd8f34560fa221a6343729474536aa7ede4fd upstream. The device mapper dm-crypt target is using scnprintf("%02x", cc->key[i]) to report the current key to userspace. However, this is not a constant-time operation and it may leak information about the key via timing, via cache access patterns or via the branch predictor. Change dm-crypt's key printing to use "%c" instead of "%02x". Also introduce hex2asc() that carefully avoids any branching or memory accesses when converting a number in the range 0 ... 15 to an ascii character. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-06dm integrity: fix error code in dm_integrity_ctr()Dan Carpenter
commit d3f2a14b8906df913cb04a706367b012db94a6e8 upstream. The "r" variable shadows an earlier "r" that has function scope. It means that we accidentally return success instead of an error code. Smatch has a warning for this: drivers/md/dm-integrity.c:4503 dm_integrity_ctr() warn: missing error code 'r' Fixes: 7eada909bfd7 ("dm: add integrity target") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20dm integrity: fix memory corruption when tag_size is less than digest sizeMikulas Patocka
commit 08c1af8f1c13bbf210f1760132f4df24d0ed46d6 upstream. It is possible to set up dm-integrity in such a way that the "tag_size" parameter is less than the actual digest size. In this situation, a part of the digest beyond tag_size is ignored. In this case, dm-integrity would write beyond the end of the ic->recalc_tags array and corrupt memory. The corruption happened in integrity_recalc->integrity_sector_checksum->crypto_shash_final. Fix this corruption by increasing the tags array so that it has enough padding at the end to accomodate the loop in integrity_recalc() being able to write a full digest size for the last member of the tags array. Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20dm mpath: only use ktime_get_ns() in historical selectorKhazhismel Kumykov
[ Upstream commit ce40426fdc3c92acdba6b5ca74bc7277ffaa6a3d ] Mixing sched_clock() and ktime_get_ns() usage will give bad results. Switch hst_select_path() from using sched_clock() to ktime_get_ns(). Also rename path_service_time()'s 'sched_now' variable to 'now'. Fixes: 2613eab11996 ("dm mpath: add Historical Service Time Path Selector") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13dm: requeue IO if mapping table not yet availableMike Snitzer
[ Upstream commit fa247089de9936a46e290d4724cb5f0b845600f5 ] Update both bio-based and request-based DM to requeue IO if the mapping table not available. This race of IO being submitted before the DM device ready is so narrow, yet possible for initial table load given that the DM device's request_queue is created prior, that it best to requeue IO to handle this unlikely case. Reported-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13dm ioctl: prevent potential spectre v1 gadgetJordy Zomer
[ Upstream commit cd9c88da171a62c4b0f1c70e50c75845969fbc18 ] It appears like cmd could be a Spectre v1 gadget as it's supplied by a user and used as an array index. Prevent the contents of kernel memory from being leaked to userspace via speculative execution by using array_index_nospec. Signed-off-by: Jordy Zomer <jordy@pwning.systems> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08dm crypt: fix get_key_size compiler warning if !CONFIG_KEYSAashish Sharma
[ Upstream commit 6fc51504388c1a1a53db8faafe9fff78fccc7c87 ] Explicitly convert unsigned int in the right of the conditional expression to int to match the left side operand and the return type, fixing the following compiler warning: drivers/md/dm-crypt.c:2593:43: warning: signed and unsigned type in conditional expression [-Wsign-compare] Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service") Signed-off-by: Aashish Sharma <shraash@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-08bcache: fixup multiple threads crashMingzhe Zou
commit 887554ab96588de2917b6c8c73e552da082e5368 upstream. When multiple threads to check btree nodes in parallel, the main thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag: wait_event_interruptible(check_state->wait, atomic_read(&check_state->started) == 0 || test_bit(CACHE_SET_IO_DISABLE, &c->flags)); However, the bch_btree_node_read and bch_btree_node_read_done maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE will be set. If the flag already set, the main thread return error. At the same time, maybe some threads still running and read NULL pointer, the kernel will crash. This patch change the event wait condition, the main thread must wait for all threads to stop. Fixes: 8e7102273f597 ("bcache: make bch_btree_check() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08dm integrity: set journal entry unused when shrinking deviceMikulas Patocka
commit cc09e8a9dec4f0e8299e80a7a2a8e6f54164a10b upstream. Commit f6f72f32c22c ("dm integrity: don't replay journal data past the end of the device") skips journal replay if the target sector points beyond the end of the device. Unfortunatelly, it doesn't set the journal entry unused, which resulted in this BUG being triggered: BUG_ON(!journal_entry_is_unused(je)) Fix this by calling journal_entry_set_unused() for this case. Fixes: f6f72f32c22c ("dm integrity: don't replay journal data past the end of the device") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Milan Broz <gmazyland@gmail.com> [snitzer: revised header] Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08dm: fix double accounting of flush with dataMike Snitzer
commit 8d394bc4adf588ca4a0650745167cb83f86c18c9 upstream. DM handles a flush with data by first issuing an empty flush and then once it completes the REQ_PREFLUSH flag is removed and the payload is issued. The problem fixed by this commit is that both the empty flush bio and the data payload will account the full extent of the data payload. Fix this by factoring out dm_io_acct() and having it wrap all IO accounting to set the size of bio with REQ_PREFLUSH to 0, account the IO, and then restore the original size. Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08dm: interlock pending dm_io and dm_wait_for_bios_completionMike Snitzer
commit 9f6dc633761006f974701d4c88da71ab68670749 upstream. Commit d208b89401e0 ("dm: fix mempool NULL pointer race when completing IO") didn't go far enough. When bio_end_io_acct ends the count of in-flight I/Os may reach zero and the DM device may be suspended. There is a possibility that the suspend races with dm_stats_account_io. Fix this by adding percpu "pending_io" counters to track outstanding dm_io. Move kicking of suspend queue to dm_io_dec_pending(). Also, rename md_in_flight_bios() to dm_in_flight_bios() and update it to iterate all pending_io counters. Fixes: d208b89401e0 ("dm: fix mempool NULL pointer race when completing IO") Cc: stable@vger.kernel.org Co-developed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08dm: fix use-after-free in dm_cleanup_zoned_dev()Kirill Tkhai
commit 588b7f5df0cb64f281290c7672470c006abe7160 upstream. dm_cleanup_zoned_dev() uses queue, so it must be called before blk_cleanup_disk() starts its killing: blk_cleanup_disk->blk_cleanup_queue()->kobject_put()->blk_release_queue()-> ->...RCU...->blk_free_queue_rcu()->kmem_cache_free() Otherwise, RCU callback may be executed first and dm_cleanup_zoned_dev() will touch free'd memory: BUG: KASAN: use-after-free in dm_cleanup_zoned_dev+0x33/0xd0 Read of size 8 at addr ffff88805ac6e430 by task dmsetup/681 CPU: 4 PID: 681 Comm: dmsetup Not tainted 5.17.0-rc2+ #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x7d print_address_description.constprop.0+0x1f/0x150 ? dm_cleanup_zoned_dev+0x33/0xd0 kasan_report.cold+0x7f/0x11b ? dm_cleanup_zoned_dev+0x33/0xd0 dm_cleanup_zoned_dev+0x33/0xd0 __dm_destroy+0x26a/0x400 ? dm_blk_ioctl+0x230/0x230 ? up_write+0xd8/0x270 dev_remove+0x156/0x1d0 ctl_ioctl+0x269/0x530 ? table_clear+0x140/0x140 ? lock_release+0xb2/0x750 ? remove_all+0x40/0x40 ? rcu_read_lock_sched_held+0x12/0x70 ? lock_downgrade+0x3c0/0x3c0 ? rcu_read_lock_sched_held+0x12/0x70 dm_ctl_ioctl+0xa/0x10 __x64_sys_ioctl+0xb9/0xf0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fb6dfa95c27 Fixes: bb37d77239af ("dm: introduce zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08dm stats: fix too short end duration_ns when using precise_timestampsMike Snitzer
commit 0cdb90f0f306384ecbc60dfd6dc48cdbc1f2d0d8 upstream. dm_stats_account_io()'s STAT_PRECISE_TIMESTAMPS support doesn't handle the fact that with commit b879f915bc48 ("dm: properly fix redundant bio-based IO accounting") io->start_time _may_ be in the past (meaning the start_io_acct() was deferred until later). Add a new dm_stats_recalc_precise_timestamps() helper that will set/clear a new 'precise_timestamps' flag in the dm_stats struct based on whether any configured stats enable STAT_PRECISE_TIMESTAMPS. And update DM core's alloc_io() to use dm_stats_record_start() to set stats_aux.duration_ns if stats->precise_timestamps is true. Also, remove unused 'last_sector' and 'last_rw' members from the dm_stats struct. Fixes: b879f915bc48 ("dm: properly fix redundant bio-based IO accounting") Cc: stable@vger.kernel.org Co-developed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-17block: fix surprise removal for drivers calling blk_set_queue_dyingChristoph Hellwig
Various block drivers call blk_set_queue_dying to mark a disk as dead due to surprise removal events, but since commit 8e141f9eb803 that doesn't work given that the GD_DEAD flag needs to be set to stop I/O. Replace the driver calls to blk_set_queue_dying with a new (and properly documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the only remaining caller. Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk") Reported-by: Markus Blöchl <markus.bloechl@ipetronik.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02md: fix NULL pointer deref with nowait but no mddev->queueSong Liu
Leon reported NULL pointer deref with nowait support: [ 15.123761] device-mapper: raid: Loading target version 1.15.1 [ 15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1 [ 15.124192] device-mapper: raid: Choosing default region size of 4MiB [ 15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060 [ 15.129530] #PF: supervisor write access in kernel mode [ 15.129533] #PF: error_code(0x0002) - not-present page [ 15.129535] PGD 0 P4D 0 [ 15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e [ 15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021 [ 15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20 [ 15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \ 00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \ 31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00 [ 15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202 [ 15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000 [ 15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d [ 15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000 [ 15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070 [ 15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001 [ 15.129572] FS: 00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000 [ 15.129575] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0 [ 15.129580] Call Trace: [ 15.129582] <TASK> [ 15.129584] md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531] [ 15.129597] raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63] [ 15.129604] ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129615] dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129625] table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129635] ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129644] ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129655] dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129663] __x64_sys_ioctl+0x8e/0xd0 [ 15.129667] do_syscall_64+0x5c/0x90 [ 15.129672] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129675] ? do_syscall_64+0x69/0x90 [ 15.129677] ? do_syscall_64+0x69/0x90 [ 15.129679] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129682] ? do_syscall_64+0x69/0x90 [ 15.129684] ? do_syscall_64+0x69/0x90 [ 15.129686] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 15.129689] RIP: 0033:0x7fa96ecd559b [ 15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \ c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \ ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48 [ 15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b [ 15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003 [ 15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27 [ 15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610 [ 15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530 [ 15.129709] </TASK> This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check. Fixes: f51d46d0e7cb ("md: add support for REQ_NOWAIT") Reported-by: Leon Möller <jkhsjdhjs@totally.rip> Tested-by: Leon Möller <jkhsjdhjs@totally.rip> Cc: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-28dm: properly fix redundant bio-based IO accountingMike Snitzer
Record the start_time for a bio but defer the starting block core's IO accounting until after IO is submitted using bio_start_io_acct_time(). This approach avoids the need to mess around with any of the individual IO stats in response to a bio_split() that follows bio submission. Reported-by: Bud Brown <bubrown@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Depends-on: e45c47d1f94e ("block: add bio_start_io_acct_time() to control start_time") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220128155841.39644-4-snitzer@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-28dm: revert partial fix for redundant bio-based IO accountingMike Snitzer
Reverts a1e1cb72d9649 ("dm: fix redundant IO accounting for bios that need splitting") because it was too narrow in scope (only addressed redundant 'sectors[]' accounting and not ios, nsecs[], etc). Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220128155841.39644-3-snitzer@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-12Merge tag 'libnvdimm-for-5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax and libnvdimm updates from Dan Williams: "The bulk of this is a rework of the dax_operations API after discovering the obstacles it posed to the work-in-progress DAX+reflink support for XFS and other copy-on-write filesystem mechanics. Primarily the need to plumb a block_device through the API to handle partition offsets was a sticking point and Christoph untangled that dependency in addition to other cleanups to make landing the DAX+reflink support easier. The DAX_PMEM_COMPAT option has been around for 4 years and not only are distributions shipping userspace that understand the current configuration API, but some are not even bothering to turn this option on anymore, so it seems a good time to remove it per the deprecation schedule. Recall that this was added after the device-dax subsystem moved from /sys/class/dax to /sys/bus/dax for its sysfs organization. All recent functionality depends on /sys/bus/dax. Some other miscellaneous cleanups and reflink prep patches are included as well. Summary: - Simplify the dax_operations API: - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining and applying a partition offset to all its DAX iomap operations. - Remove wrappers and device-mapper stacked callbacks for ->copy_from_iter() and ->copy_to_iter() in favor of moving block_device relative offset responsibility to the dax_direct_access() caller. - Remove the need for an @bdev in filesystem-DAX infrastructure - Remove unused uio helpers copy_from_iter_flushcache() and copy_mc_to_iter() as only the non-check_copy_size() versions are used for DAX. - Prepare XFS for the pending (next merge window) DAX+reflink support - Remove deprecated DEV_DAX_PMEM_COMPAT support - Cleanup a straggling misuse of the GUID api" * tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits) iomap: Fix error handling in iomap_zero_iter() ACPI: NFIT: Import GUID before use dax: remove the copy_from_iter and copy_to_iter methods dax: remove the DAXDEV_F_SYNC flag dax: simplify dax_synchronous and set_dax_synchronous uio: remove copy_from_iter_flushcache() and copy_mc_to_iter() iomap: turn the byte variable in iomap_zero_iter into a ssize_t memremap: remove support for external pgmap refcounts fsdax: don't require CONFIG_BLOCK iomap: build the block based code conditionally dax: fix up some of the block device related ifdefs fsdax: shift partition offset handling into the file systems dax: return the partition offset from fs_dax_get_by_bdev iomap: add a IOMAP_DAX flag xfs: pass the mapping flags to xfs_bmbt_to_iomap xfs: use xfs_direct_write_iomap_ops for DAX zeroing xfs: move dax device handling into xfs_{alloc,free}_buftarg ext4: cleanup the dax handling in ext4_fill_super ext2: cleanup the dax handling in ext2_fill_super fsdax: decouple zeroing from the iomap buffered I/O code ...
2022-01-12Merge tag 'for-5.17/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Fixes and improvements to dm btree and dm space map code in persistent-data library used by thinp and cache. - Update DM integrity to use struct_group() to zero struct journal_sector. - Update DM sysfs to use default_groups in kobj_type. * tag 'for-5.17/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm sysfs: use default_groups in kobj_type dm integrity: Use struct_group() to zero struct journal_sector dm space map common: add bounds check to sm_ll_lookup_bitmap() dm btree: add a defensive bounds check to insert_at() dm btree remove: change a bunch of BUG_ON() calls to proper errors dm btree spine: eliminate duplicate le32_to_cpu() in node_check() dm btree spine: remove extra node_check function declaration
2022-01-12Merge tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - mtip32xx pci cleanups (Bjorn) - mtip32xx conversion to generic power management (Vaibhav) - rsxx pci powermanagement cleanups (Bjorn) - Remove the rsxx driver. This hardware never saw much adoption, and it's been end of lifed for a while. (Christoph) - MD pull request from Song: - REQ_NOWAIT support (Vishal Verma) - raid6 benchmark optimization (Dirk Müller) - Fix for acct bioset (Xiao Ni) - Clean up max_queued_requests (Mariusz Tkaczyk) - PREEMPT_RT optimization (Davidlohr Bueso) - Use default_groups in kobj_type (Greg Kroah-Hartman) - Use attribute groups in pktcdvd and rnbd (Greg) - NVMe pull request from Christoph: - increment request genctr on completion (Keith Busch, Geliang Tang) - add a 'iopolicy' module parameter (Hannes Reinecke) - print out valid arguments when reading from /dev/nvme-fabrics (Hannes Reinecke) - Use struct_group() in drbd (Kees) - null_blk fixes (Ming) - Get rid of congestion logic in pktcdvd (Neil) - Floppy ejection hang fix (Tasos) - Floppy max user request size fix (Xiongwei) - Loop locking fix (Tetsuo) * tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block: (32 commits) md: use default_groups in kobj_type md: Move alloc/free acct bioset in to personality lib/raid6: Use strict priority ranking for pq gen() benchmarking lib/raid6: skip benchmark of non-chosen xor_syndrome functions md: fix spelling of "its" md: raid456 add nowait support md: raid10 add nowait support md: raid1 add nowait support md: add support for REQ_NOWAIT md: drop queue limitation for RAID1 and RAID10 md/raid5: play nice with PREEMPT_RT block/rnbd-clt-sysfs: use default_groups in kobj_type pktcdvd: convert to use attribute groups block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0 nvme: add 'iopolicy' module parameter nvme: drop unused variable ctrl in nvme_setup_cmd nvme: increment request genctr on completion nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics block: remove the rsxx driver rsxx: Drop PCI legacy power management ...
2022-01-12Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Unify where the struct request handling code is located in the blk-mq code (Christoph) - Header cleanups (Christoph) - Clean up the io_context handling code (Christoph, me) - Get rid of ->rq_disk in struct request (Christoph) - Error handling fix for add_disk() (Christoph) - request allocation cleanusp (Christoph) - Documentation updates (Eric, Matthew) - Remove trivial crypto unregister helper (Eric) - Reduce shared tag overhead (John) - Reduce poll_stats memory overhead (me) - Known indirect function call for dio (me) - Use atomic references for struct request (me) - Support request list issue for block and NVMe (me) - Improve queue dispatch pinning (Ming) - Improve the direct list issue code (Keith) - BFQ improvements (Jan) - Direct completion helper and use it in mmc block (Sebastian) - Use raw spinlock for the blktrace code (Wander) - fsync error handling fix (Ye) - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me) * tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits) MAINTAINERS: add entries for block layer documentation docs: block: remove queue-sysfs.rst docs: sysfs-block: document virt_boundary_mask docs: sysfs-block: document stable_writes docs: sysfs-block: fill in missing documentation from queue-sysfs.rst docs: sysfs-block: add contact for nomerges docs: sysfs-block: sort alphabetically docs: sysfs-block: move to stable directory block: don't protect submit_bio_checks by q_usage_counter block: fix old-style declaration nvme-pci: fix queue_rqs list splitting block: introduce rq_list_move block: introduce rq_list_for_each_safe macro block: move rq_list macros to blk-mq.h block: drop needless assignment in set_task_ioprio() block: remove unnecessary trailing '\' bio.h: fix kernel-doc warnings block: check minor range in device_add_disk() block: use "unsigned long" for blk_validate_block_size(). block: fix error unwinding in device_add_disk ...
2022-01-07Merge tag 'block-5.16-2022-01-07' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fix from Jens Axboe: "Just the md bitmap regression this time" * tag 'block-5.16-2022-01-07' of git://git.kernel.dk/linux-block: md/raid1: fix missing bitmap update w/o WriteMostly devices
2022-01-06md: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the md rdev sysfs code to use default_groups field which has been the preferred way since commit aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: Move alloc/free acct bioset in to personalityXiao Ni
bioset acct is only needed for raid0 and raid5. Therefore, md_run only allocates it for raid0 and raid5. However, this does not cover personality takeover, which may cause uninitialized bioset. For example, the following repro steps: mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1 mdadm --wait /dev/md0 mkfs.xfs /dev/md0 mdadm /dev/md0 --grow -l5 mount /dev/md0 /mnt causes panic like: [ 225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 225.934903] #PF: supervisor instruction fetch in kernel mode [ 225.935639] #PF: error_code(0x0010) - not-present page [ 225.936361] PGD 0 P4D 0 [ 225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI [ 225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706 [ 225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014 [ 225.939922] RIP: 0010:0x0 [ 225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246 [ 225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39 [ 225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800 [ 225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01 [ 225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58 [ 225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040 [ 225.946709] FS: 00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000 [ 225.947814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0 [ 225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 225.951414] Call Trace: [ 225.951787] <TASK> [ 225.952120] mempool_alloc+0xe5/0x250 [ 225.952625] ? mempool_resize+0x370/0x370 [ 225.953187] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 225.953862] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 225.954464] ? sched_clock_cpu+0x15/0x120 [ 225.955019] ? find_held_lock+0xac/0xd0 [ 225.955564] bio_alloc_bioset+0x1ed/0x2a0 [ 225.956080] ? lock_downgrade+0x3a0/0x3a0 [ 225.956644] ? bvec_alloc+0xc0/0xc0 [ 225.957135] bio_clone_fast+0x19/0x80 [ 225.957651] raid5_make_request+0x1370/0x1b70 [ 225.958286] ? sched_clock_cpu+0x15/0x120 [ 225.958797] ? __lock_acquire+0x8b2/0x3510 [ 225.959339] ? raid5_get_active_stripe+0xce0/0xce0 [ 225.959986] ? lock_is_held_type+0xd8/0x130 [ 225.960528] ? rcu_read_lock_sched_held+0xa1/0xd0 [ 225.961135] ? rcu_read_lock_bh_held+0xb0/0xb0 [ 225.961703] ? sched_clock_cpu+0x15/0x120 [ 225.962232] ? lock_release+0x27a/0x6c0 [ 225.962746] ? do_wait_intr_irq+0x130/0x130 [ 225.963302] ? lock_downgrade+0x3a0/0x3a0 [ 225.963815] ? lock_release+0x6c0/0x6c0 [ 225.964348] md_handle_request+0x342/0x530 [ 225.964888] ? set_in_sync+0x170/0x170 [ 225.965397] ? blk_queue_split+0x133/0x150 [ 225.965988] ? __blk_queue_split+0x8b0/0x8b0 [ 225.966524] ? submit_bio_checks+0x3b2/0x9d0 [ 225.967069] md_submit_bio+0x127/0x1c0 [...] Fix this by moving alloc/free of acct bioset to pers->run and pers->free. While we are on this, properly handle md_integrity_register() error in raid0_run(). Fixes: daee2024715d (md: check level before create and exit io_acct_set) Cc: stable@vger.kernel.org Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: fix spelling of "its"Randy Dunlap
Use the possessive "its" instead of the contraction "it's" in printed messages. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: raid456 add nowait supportVishal Verma
Returns EAGAIN in case the raid456 driver would block waiting for reshape. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: raid10 add nowait supportVishal Verma
This adds nowait support to the RAID10 driver. Very similar to raid1 driver changes. It makes RAID10 driver return with EAGAIN for situations where it could wait for eg: - Waiting for the barrier, - Reshape operation, - Discard operation. wait_barrier() and regular_request_wait() fn are modified to return bool to support error for wait barriers. They returns true in case of wait or if wait is not required and returns false if wait was required but not performed to support nowait. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: raid1 add nowait supportVishal Verma
This adds nowait support to the RAID1 driver. It makes RAID1 driver return with EAGAIN for situations where it could wait for eg: - Waiting for the barrier, wait_barrier() fn is modified to return bool to support error for wait barriers. It returns true in case of wait or if wait is not required and returns false if wait was required but not performed to support nowait. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: add support for REQ_NOWAITVishal Verma
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support for checking whether a given bdev supports handling of REQ_NOWAIT or not. Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target") added support for REQ_NOWAIT for dm. This uses a similar approach to incorporate REQ_NOWAIT for md based bios. This patch was tested using t/io_uring tool within FIO. A nvme drive was partitioned into 2 partitions and a simple raid 0 configuration /dev/md0 was created. md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks Before patch: $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100 Running top while the above runs: $ ps -eL | grep $(pidof io_uring) 38396 38396 pts/2 00:00:00 io_uring 38396 38397 pts/2 00:00:15 io_uring 38396 38398 pts/2 00:00:13 iou-wrk-38397 We can see iou-wrk-38397 io worker thread created which gets created when io_uring sees that the underlying device (/dev/md0 in this case) doesn't support nowait. After patch: $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100 Running top while the above runs: $ ps -eL | grep $(pidof io_uring) 38341 38341 pts/2 00:10:22 io_uring 38341 38342 pts/2 00:10:37 io_uring After running this patch, we don't see any io worker thread being created which indicated that io_uring saw that the underlying device does support nowait. This is the exact behaviour noticed on a dm device which also supports nowait. For all the other raid personalities except raid0, we would need to train pieces which involves make_request fn in order for them to correctly handle REQ_NOWAIT. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md: drop queue limitation for RAID1 and RAID10Mariusz Tkaczyk
As suggested by Neil Brown[1], this limitation seems to be deprecated. With plugging in use, writes are processed behind the raid thread and conf->pending_count is not increased. This limitation occurs only if caller doesn't use plugs. It can be avoided and often it is (with plugging). There are no reports that queue is growing to enormous size so remove queue limitation for non-plugged IOs too. [1] https://lore.kernel.org/linux-raid/162496301481.7211.18031090130574610495@noble.neil.brown.name Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
2022-01-06md/raid5: play nice with PREEMPT_RTDavidlohr Bueso
raid_run_ops() relies on the implicitly disabled preemption for its percpu ops, although this is really about CPU locality. This breaks RT semantics as it can take regular (and thus sleeping) spinlocks, such as stripe_lock. Add a local_lock such that non-RT does not change and continues to be just map to preempt_disable/enable, but makes RT happy as the region will use a per-CPU spinlock and thus be preemptible and still guarantee CPU locality. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2022-01-06dm sysfs: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the dm sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2022-01-06dm integrity: Use struct_group() to zero struct journal_sectorKees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Add struct_group() to mark region of struct journal_sector that should be initialized to zero. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2022-01-04dm space map common: add bounds check to sm_ll_lookup_bitmap()Joe Thornber
Corrupted metadata could warrant returning error from sm_ll_lookup_bitmap(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2022-01-04dm btree: add a defensive bounds check to insert_at()Joe Thornber
Corrupt metadata could trigger an out of bounds write. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>