summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2026-02-05gfs2: fiemap page fault fixAndreas Gruenbacher
In gfs2_fiemap(), we are calling iomap_fiemap() while holding the inode glock. This can lead to recursive glock taking if the fiemap buffer is memory mapped to the same inode and accessing it triggers a page fault. Fix by disabling page faults for iomap_fiemap() and faulting in the buffer by hand if necessary. Fixes xfstest generic/742. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-02-05sysfs: remove exports of sysfs_*change_owner()Greg Kroah-Hartman
Both sysfs_change_owner() and sysfs_file_change_owner() are exported to modules, but there are no in-kernel module users, so remove the exports so that crazy out-of-tree drivers don't get the impression that it is safe to call these functions at all. Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Reported-by: Lee Jones <lee@kernel.org> Reviewed-by: Lee Jones <lee@kernel.org> Link: https://patch.msgid.link/2026020541-energize-graduate-981a@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-05erofs: update compression algorithm statusGao Xiang
The following changes are proposed in the upcoming Linux 7.0: - Enable LZMA support by default, as it's already in use by Fedora 42/43 and some Android vendors for minimal filesystem sizes; - Promote DEFLATE and Zstandard out of EXPERIMENTAL status, given that they have been landed and well-tested for over a year and are already ready for general use. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-05erofs: fix inline data read failure for ztailpacking pclustersGao Xiang
Compressed folios for ztailpacking pclusters must be valid before adding these pclusters to I/O chains. Otherwise, z_erofs_decompress_pcluster() may assume they are already valid and then trigger a NULL pointer dereference. It is somewhat hard to reproduce because the inline data is in the same block as the tail of the compressed indexes, which are usually read just before. However, it may still happen if a fatal signal arrives while read_mapping_folio() is running, as shown below: erofs: (device dm-1): z_erofs_pcluster_begin: failed to get inline data -4 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 ... pc : z_erofs_decompress_queue+0x4c8/0xa14 lr : z_erofs_decompress_queue+0x160/0xa14 sp : ffffffc08b3eb3a0 x29: ffffffc08b3eb570 x28: ffffffc08b3eb418 x27: 0000000000001000 x26: ffffff8086ebdbb8 x25: ffffff8086ebdbb8 x24: 0000000000000001 x23: 0000000000000008 x22: 00000000fffffffb x21: dead000000000700 x20: 00000000000015e7 x19: ffffff808babb400 x18: ffffffc089edc098 x17: 00000000c006287d x16: 00000000c006287d x15: 0000000000000004 x14: ffffff80ba8f8000 x13: 0000000000000004 x12: 00000006589a77c9 x11: 0000000000000015 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003f x5 : 0000000000000040 x4 : ffffffffffffffe0 x3 : 0000000000000020 x2 : 0000000000000008 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: z_erofs_decompress_queue+0x4c8/0xa14 z_erofs_runqueue+0x908/0x97c z_erofs_read_folio+0x128/0x228 filemap_read_folio+0x68/0x128 filemap_get_pages+0x44c/0x8b4 filemap_read+0x12c/0x5b8 generic_file_read_iter+0x4c/0x15c do_iter_readv_writev+0x188/0x1e0 vfs_iter_read+0xac/0x1a4 backing_file_read_iter+0x170/0x34c ovl_read_iter+0xf0/0x140 vfs_read+0x28c/0x344 ksys_read+0x80/0xf0 __arm64_sys_read+0x24/0x34 invoke_syscall+0x60/0x114 el0_svc_common+0x88/0xe4 do_el0_svc+0x24/0x30 el0_svc+0x40/0xa8 el0t_64_sync_handler+0x70/0xbc el0t_64_sync+0x1bc/0x1c0 Fix this by reading the inline data before allocating and adding the pclusters to the I/O chains. Fixes: cecf864d3d76 ("erofs: support inline data decompression") Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-and-tested-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-04ceph: fix NULL pointer dereference in ceph_mds_auth_match()Viacheslav Dubeyko
The CephFS kernel client has regression starting from 6.18-rc1. We have issue in ceph_mds_auth_match() if fs_name == NULL: const char fs_name = mdsc->fsc->mount_options->mds_namespace; ... if (auth->match.fs_name && strcmp(auth->match.fs_name, fs_name)) { / fsname mismatch, try next one */ return 0; } Patrick Donnelly suggested that: In summary, we should definitely start decoding `fs_name` from the MDSMap and do strict authorizations checks against it. Note that the `-o mds_namespace=foo` should only be used for selecting the file system to mount and nothing else. It's possible no mds_namespace is specified but the kernel will mount the only file system that exists which may have name "foo". This patch reworks ceph_mdsmap_decode() and namespace_equals() with the goal of supporting the suggested concept. Now struct ceph_mdsmap contains m_fs_name field that receives copy of extracted FS name by ceph_extract_encoded_string(). For the case of "old" CephFS file systems, it is used "cephfs" name. [ idryomov: replace redundant %*pE with %s in ceph_mdsmap_decode(), get rid of a series of strlen() calls in ceph_namespace_match(), drop changes to namespace_equals() body to avoid treating empty mds_namespace as equal, drop changes to ceph_mdsc_handle_fsmap() as namespace_equals() isn't an equivalent substitution there ] Cc: stable@vger.kernel.org Fixes: 22c73d52a6d0 ("ceph: fix multifs mds auth caps issue") Link: https://tracker.ceph.com/issues/73886 Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Tested-by: Patrick Donnelly <pdonnell@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-02-04fsverity: remove inode from fsverity_verification_ctxEric Biggers
This field is no longer used, so remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260202213339.143683-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04fsverity: use a hashtable to find the fsverity_infoChristoph Hellwig
Use the kernel's resizable hash table (rhashtable) to find the fsverity_info. This way file systems that want to support fsverity don't have to bloat every inode in the system with an extra pointer. The trade-off is that looking up the fsverity_info is a bit more expensive now, but the main operations are still dominated by I/O and hashing overhead. The rhashtable implementations requires no external synchronization, and the _fast versions of the APIs provide the RCU critical sections required by the implementation. Because struct fsverity_info is only removed on inode eviction and does not contain a reference count, there is no need for an extended critical section to grab a reference or validate the object state. The file open path uses rhashtable_lookup_get_insert_fast, which can either find an existing object for the hash key or insert a new one in a single atomic operation, so that concurrent opens never instantiate duplicate fsverity_info structure. FS_IOC_ENABLE_VERITY must already be synchronized by a combination of i_rwsem and file system flags and uses rhashtable_lookup_insert_fast, which errors out on an existing object for the hash key as an additional safety check. Because insertion into the hash table now happens before S_VERITY is set, fsverity just becomes a barrier and a flag check and doesn't have to look up the fsverity_info at all, so there is only a single lookup per ->read_folio or ->readahead invocation. For btrfs there is an additional one for each bio completion, while for ext4 and f2fs the fsverity_info is stored in the per-I/O context and reused for the completion workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de [EB: folded in fix for missing fsverity_free_info()] Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04btrfs: consolidate fsverity_info lookupChristoph Hellwig
Look up the fsverity_info once in btrfs_do_readpage, and then use it for all operations performed there, and do the same in end_folio_read for all folios processed there. The latter is also changed to derive the inode from the btrfs_bio - while bbio->inode is optional, it is always set for buffered reads. This amortizes the lookup better once it becomes less efficient. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20260202060754.270269-11-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04f2fs: consolidate fsverity_info lookupChristoph Hellwig
Look up the fsverity_info once in f2fs_mpage_readpages, and then use it for the readahead, local verification of holes and pass it along to the I/O completion workqueue in struct bio_post_read_ctx. Do the same thing in f2fs_get_read_data_folio for reads that come from garbage collection and other background activities. This amortizes the lookup better once it becomes less efficient. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20260202060754.270269-10-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04ext4: consolidate fsverity_info lookupChristoph Hellwig
Look up the fsverity_info once in ext4_mpage_readpages, and then use it for the readahead, local verification of holes and pass it along to the I/O completion workqueue in struct bio_post_read_ctx. This amortizes the lookup better once it becomes less efficient. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260202060754.270269-9-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-04ext4: specify the free pointer offset for ext4_inode_cacheHarry Yoo
Convert ext4_inode_cache to use the kmem_cache_args interface and specify a free pointer offset. Since ext4_inode_cache uses a constructor, the free pointer would be placed after the object to prevent overwriting fields used by the constructor. However, some fields such as ->i_flags are not used by the constructor and can safely be repurposed for the free pointer. Specify the free pointer offset at i_flags to reduce the object size. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-4-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-03fs/tests: exec: drop duplicate bprm_stack_limits test vectorsTitouan Ameline de Cadeville
Remove duplicate entries from the bprm_stack_limits KUnit test vector table. The duplicates do not add coverage and only increase test size. Signed-off-by: Titouan Ameline de Cadeville <titouan.ameline@gmail.com> Fixes: 60371f43e56b ("exec: Add KUnit test for bprm_stack_limits()") Link: https://patch.msgid.link/20260203175950.43710-1-titouan.ameline@gmail.com Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-04fs/ntfs3: allow explicit boolean acl/prealloc mount optionsKonstantin Komarov
This patch improves mount option parsing by allowing explicit boolean values for acl and prealloc. Previously those options were exposed only as presence/absence flags. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-02-03Merge tag 'v6.19rc8-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: "Two small client memory leak fixes" * tag 'v6.19rc8-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb/client: fix memory leak in SendReceive() smb/client: fix memory leak in smb2_open_file()
2026-02-03ceph: fix oops due to invalid pointer for kfree() in parse_longname()Daniel Vogelbacher
This fixes a kernel oops when reading ceph snapshot directories (.snap), for example by simply running `ls /mnt/my_ceph/.snap`. The variable str is guarded by __free(kfree), but advanced by one for skipping the initial '_' in snapshot names. Thus, kfree() is called with an invalid pointer. This patch removes the need for advancing the pointer so kfree() is called with correct memory pointer. Steps to reproduce: 1. Create snapshots on a cephfs volume (I've 63 snaps in my testcase) 2. Add cephfs mount to fstab $ echo "samba-fileserver@.files=/volumes/datapool/stuff/3461082b-ecc9-4e82-8549-3fd2590d3fb6 /mnt/test/stuff ceph acl,noatime,_netdev 0 0" >> /etc/fstab 3. Reboot the system $ systemctl reboot 4. Check if it's really mounted $ mount | grep stuff 5. List snapshots (expected 63 snapshots on my system) $ ls /mnt/test/stuff/.snap Now ls hangs forever and the kernel log shows the oops. Cc: stable@vger.kernel.org Fixes: 101841c38346 ("[ceph] parse_longname(): strrchr() expects NUL-terminated string") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220807 Suggested-by: Helge Deller <deller@gmx.de> Signed-off-by: Daniel Vogelbacher <daniel@chaospixel.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-02-03Merge tag 'for-6.19-rc8-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A regression fix for a memory leak when raid56 is used" * tag 'for-6.19-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap
2026-02-03gfs2: fix memory leaks in gfs2_fill_super error pathDeepanshu Kartikey
Fix two memory leaks in the gfs2_fill_super() error handling path when transitioning a filesystem to read-write mode fails. First leak: kthread objects (thread_struct, task_struct, etc.) When gfs2_freeze_lock_shared() fails after init_threads() succeeds, the created kernel threads (logd and quotad) are never destroyed. This occurs because the fail_per_node label doesn't call gfs2_destroy_threads(). Second leak: quota bitmap buffer (8192 bytes) When gfs2_make_fs_rw() fails after gfs2_quota_init() succeeds but before other operations complete, the allocated quota bitmap is never freed. The fix moves thread cleanup to the fail_per_node label to handle all error paths uniformly. gfs2_destroy_threads() is safe to call unconditionally as it checks for NULL pointers. Quota cleanup is added in gfs2_make_fs_rw() to properly handle the withdrawal case where quota initialization succeeds but the filesystem is then withdrawn. Thread leak backtrace (gfs2_freeze_lock_shared failure): unreferenced object 0xffff88801d7bca80 (size 4480): copy_process+0x3a1/0x4670 kernel/fork.c:2422 kernel_clone+0xf3/0x6e0 kernel/fork.c:2779 kthread_create_on_node+0x100/0x150 kernel/kthread.c:478 init_threads+0xab/0x350 fs/gfs2/ops_fstype.c:611 gfs2_fill_super+0xe5c/0x1240 fs/gfs2/ops_fstype.c:1265 Quota leak backtrace (gfs2_make_fs_rw failure): unreferenced object 0xffff88812de7c000 (size 8192): gfs2_quota_init+0xe5/0x820 fs/gfs2/quota.c:1409 gfs2_make_fs_rw+0x7a/0xe0 fs/gfs2/super.c:149 gfs2_fill_super+0xfbb/0x1240 fs/gfs2/ops_fstype.c:1275 Reported-by: syzbot+aac438d7a1c44071e04b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=aac438d7a1c44071e04b Fixes: 6c7410f44961 ("gfs2: gfs2_freeze_lock_shared cleanup") Fixes: b66f723bb552 ("gfs2: Improve gfs2_make_fs_rw error handling") Link: https://lore.kernel.org/all/20260131062509.77974-1-kartikey406@gmail.com/T/ [v1] Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-02-03btrfs: get rid of compressed_bio::compressed_folios[]Qu Wenruo
Now there is no one utilizing that member, we can safely remove it along with compressed_bio::nr_folios member. The size is reduced from 352 to 336 bytes on x86_64. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: get rid of compressed_folios[] usage for encoded writesQu Wenruo
Currently only encoded writes utilized btrfs_submit_compressed_write(), which utilized compressed_bio::compressed_folios[] array. Change the only call site to call the new helper, btrfs_alloc_compressed_write(), to allocate a compressed bio, then queue needed folios into that bio, and finally call btrfs_submit_compressed_write() to submit the compressed bio. This change has one hidden benefit, previously we used btrfs_alloc_folio_array() for the folios of btrfs_submit_compressed_read(), which doesn't utilize the compression page pool for bs == ps cases. Now we call btrfs_alloc_compr_folio() which will benefit from the page pool. The other obvious benefit is that we no longer need to allocate an array to hold all those folios, thus one less error path. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: get rid of compressed_folios[] usage for compressed readQu Wenruo
Currently btrfs_submit_compressed_read() still uses compressed_bio::compressed_folios[] array. Change it to allocate each folio and queue them into the compressed bio so that we do not need to allocate that array. Considering how small each compressed read bio is (less than 128KiB), we do not benefit that much from btrfs_alloc_folio_array() anyway, while we may benefit more from btrfs_alloc_compr_folio() by using the global folio pool. So changing from btrfs_alloc_folio_array() to btrfs_alloc_compr_folio() in a loop should still be fine. This removes one error path, and paves the way to completely remove compressed_folios[] array. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove the old btrfs_compress_folios() infrastructureQu Wenruo
Since it's been replaced by btrfs_compress_bio(), remove all involved functions. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: switch to btrfs_compress_bio() interface for compressed writesQu Wenruo
This switch has the following benefits: - A single structure to handle all compression No more extra members like compressed_folios[] nor compress_type, all those members. This means the structure of async_extent is much smaller. - Simpler error handling A single cleanup_compressed_bio() will handle everything, no extra compressed_folios[] array to bother. Some extra notes: - Compressed folios releasing Now we go bio_for_each_folio_all() loop to release the folios of the bio. This will work for both the old compressed_folios[] array and the new pure bio method. For old compressed_folios[], all folios of that array is queued into the bio, thus releasing the folios from the bio is the same as releasing each folio of that array. We just need to be sure no double releasing from the array and bio. For the new pure bio method, that array is NULL, just usual folio releasing of the bio. The only extra note is for end_bbio_compressed_read(), as the folios are allocated using btrfs_alloc_folio_array(), thus the folios should only be released by regular folio_put(), not btrfs_free_compr_folio(). - Rounding up the bio to block size We cannot simply increase bi_size, as that will not increase the length of the last bvec. Thus we have to properly add the last part into the bio. This will be done by the helper, round_up_last_block(). The reason we do not round those bios up at compression time is to get the unaligned compressed size, so that they can be utilized for inline extents. If we round the bios up at *_compress_bio(), then every compressed bio will be larger than or equal to one fs block, resulting no inline compressed extent. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: introduce btrfs_compress_bio() helperQu Wenruo
The helper will allocate a new compressed_bio, do the compression, and return it to the caller. This greatly simplifies the compression path, as we no longer need to allocate a folio array thus no extra error path, furthermore the compressed bio structure can be utilized for submission with very minor modifications (like rounding up the bi_size and populate the bi_sector). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zlib: introduce zlib_compress_bio() helperQu Wenruo
The new helper has the following enhancements against the existing zlib_compress_folios() - Much smaller parameter list No more shared IN/OUT members, no need to pre-allocate a compressed_folios[] array. Just a workspace and compressed_bio pointer, everything we need can be extracted from that @cb pointer. - Ready-to-be-submitted compressed bio Although the caller still needs to do some common works like rounding up and zeroing the tailing part of the last fs block. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zstd: introduce zstd_compress_bio() helperQu Wenruo
The new helper has the following enhancements against the existing zstd_compress_folios() - Much smaller parameter list No more shared IN/OUT members, no need to pre-allocate a compressed_folios[] array. Just a workspace and compressed_bio pointer, everything we need can be extracted from that @cb pointer. - Ready-to-be-submitted compressed bio Although the caller still needs to do some common works like rounding up and zeroing the tailing part of the last fs block. Overall the workflow is the same as zstd_compress_folios(), but with some minor changes: - @start/@len is now constant For the current input file offset, use @start + @tot_in instead. The original change of @start and @len makes it pretty hard to know what value we're really comparing to. - No more @cur_len It's only utilized when switching input buffer. Directly use btrfs_calc_input_length() instead. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: lzo: introduce lzo_compress_bio() helperQu Wenruo
The new helper has the following enhancements against the existing lzo_compress_folios() - Much smaller parameter list No more shared IN/OUT members, no need to pre-allocate a compressed_folios[] array. Just a workspace list header and a compressed_bio pointer. Everything else can be fetched from that @cb pointer. - Read-to-be-submitted compressed bio Although the caller still needs to do some common works like rounding up and zeroing the tailing part of the last fs block. Some workloads are specific to lZO that is not needed with other multi-run compression interfaces: - Need to write a LZO header or segment header Use the new write_and_queue_folio() helper to do the bio_add_folio() call and folio switching. - Need to update the LZO header after compression is done Use bio_first_folio_all() to grab the first folio and update the header. - Extra corner case of error handling This can happen when we have queued part of a folio and hit an error. In that case those folios will be released by the bio. Thus we can only release the folio that has no queued part. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: factor out the zone loading part into a testable functionNaohiro Aota
Separate btrfs_load_block_group_* calling path into a function, so that it can be an entry point of unit test. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: add cleanup function for btrfs_free_chunk_mapNaohiro Aota
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: tests: add cleanup functions for test specific functionsNaohiro Aota
Add auto-cleanup helper functions for btrfs_free_dummy_fs_info and btrfs_free_dummy_block_group. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmapFilipe Manana
We allocate the bitmap but we never free it in free_raid_bio_pointers(). Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap of a raid bio. Fixes: 1810350b04ef ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap") Reported-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/ Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: tests: add unit tests for pending extent walking functionsBoris Burkov
I ran into another sort of trivial bug in v1 of the patch and concluded that these functions really ought to be unit tested. These two functions form the core of searching the chunk allocation pending extent bitmap and have relatively easily definable semantics, so unit testing them can help ensure the correctness of chunk allocation. I also made a minor unrelated fix in volumes.h to properly forward declare btrfs_space_info. Because of the order of the includes in the new test, this was actually hitting a latent build warning. Note: This is an early example for me of a commit authored in part by an AI agent, so I wanted to more clear about what I did. I defined a trivial test and explained the set of tests I wanted to the agent and it produced the large set of test cases seen here. I then checked each test case to make sure it matched the description and simplified the constants and numbers until they looked reasonable to me. I then checked the looping logic to make sure it made sense to the original spirit of the trivial test. Finally, carefully combed over all the lines it wrote to loop over the tests it generated to make sure they followed our code style guide. Assisted-by: Claude:claude-opus-4-5 Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocationBoris Burkov
I have been observing a number of systems aborting at insert_dev_extents() in btrfs_create_pending_block_groups(). The following is a sample stack trace of such an abort coming from forced chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this can theoretically happen to any DUP chunk allocation. [81.801] ------------[ cut here ]------------ [81.801] BTRFS: Transaction aborted (error -17) [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319 [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs] [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282 [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000 [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0 [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007 [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000 [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000 [81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000 [81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0 [81.810] Call Trace: [81.810] <TASK> [81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs] [81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs] [81.811] kernfs_fop_write_iter+0x15f/0x240 [81.812] vfs_write+0x264/0x500 [81.812] ksys_write+0x6c/0xe0 [81.812] do_syscall_64+0x66/0x770 [81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e [81.813] RIP: 0033:0x7fec6be66197 [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197 [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001 [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000 [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0 [81.817] </TASK> [81.817] irq event stamp: 20039 [81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60 [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60 [81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.820] ---[ end trace 0000000000000000 ]--- [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists Inspecting these aborts with drgn, I observed a pattern of overlapping chunk_maps. Note how stripe 1 of the first chunk overlaps in physical address with stripe 0 of the second chunk. Physical Start Physical End Length Logical Type Stripe ---------------------------------------------------------------------------------------------------- 0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META|DUP 0/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META|DUP 1/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META|DUP 0/2 0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META|DUP 1/2 Now how could this possibly happen? All chunk allocation is protected by the chunk_mutex so racing allocations should see a consistent view of the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree (device->alloc_state as set by chunk_map_device_set_bits()) The tree itself is protected by a spin lock, and clearing/setting the bits is always protected by fs_info->mapping_tree_lock, so no race is apparent. It turns out that there is a subtle bug in the logic regarding chunk allocations that have happened in the current transaction, known as "pending extents". The chunk allocation as defined in find_free_dev_extent() is a loop which searches the commit root of the dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it then checks alloc_state bitmap for any pending extents and adjusts the hole that it finds accordingly. However, the logic in that adjustment assumes that the first pending extent is the only one in that range. e.g., given a layout with two non-consecutive pending extents in a hole passed to dev_extent_hole_check() via *hole_start and *hole_size: |----pending A----| real hole |----pending B----| | candidate hole | *hole_start *hole_start + *hole_size the code incorrectly returns a "hole" from the end of pending extent A until the passed in hole end, failing to account for pending B. However, it is not entirely obvious that it is actually possible to produce such a layout. I was able to reproduce it, but with some contortions: I continued to use the force chunk allocation sysfs file and I introduced a long delay (10 seconds) into the start of the cleaner thread. I also prevented the unused bgs cleaning logic from ever deleting metadata bgs. These help make it easier to deterministically produce the condition but shouldn't really matter if you imagine the conditions happening by race/luck. Allocations/frees can happen concurrently with the cleaner thread preparing to process an unused extent and both create some used chunks with an unused chunk interleaved, all during one transaction. Then btrfs_delete_unused_bgs() sees the unused one and clears it, leaving a range with several pending chunk allocations and a gap in the middle. The basic idea is that the unused_bgs cleanup work happens on a worker so if we allocate 3 block groups in one transaction, then the cleaner work kicked off by the previous transaction comes through and deletes the middle one of the 3, then the commit root shows no dev extents and we have the bad pattern in the extent-io-tree. One final consideration is that the code happens to loop to the next hole if there are no more extents at all, so we need one more dev extent way past the area we are working in. Something like the following demonstrates the technique: # push the BG frontier out to 20G fallocate -l 20G $mnt/foo # allocate one more that will prevent the "no more dev extents" luck fallocate -l 1G $mnt/sticky # sync sync # clear out the allocation area rm $mnt/foo sync _cleaner # let everything quiesce sleep 20 sync # dev tree should have one bg 20G out and the rest at the beginning.. # sort of like an empty FS but with a random sticky chunk. # kick off the cleaner in the background, remember it will sleep 10s # before doing interesting work _cleaner & sleep 3 # create 3 trivial block groups, all empty, all immediately marked as unused. echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" # let the cleaner thread definitely finish, it will remove the data bg sleep 10 # this allocation sees the non-consecutive pending metadata chunks with # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC! echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" As for the fix, it is not that obvious. I could not see a trivial way to do it even by adding backup loops into find_free_dev_extent(), so I opted to change the semantics of dev_extent_hole_check() to not stop looping until it finds a sufficiently big hole. For clarity, this also required changing the helper function contains_pending_extent() into two new helpers which find the first pending extent and the first suitable hole in a range. I attempted to clean up the documentation and range calculations to be as consistent and clear as possible for the future. I also looked at the zoned case and concluded that the loop there is different and not to be unified with this one. As far as I can tell, the zoned check will only further constrain the hole so looping back to find more holes is acceptable. Though given that zoned really only appends, I find it highly unlikely that it is susceptible to this bug. Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") Reported-by: Dimitrios Apostolou <jimis@gmx.net> Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: fix transaction commit blocking during trim of unallocated spacejinbaohong
When trimming unallocated space, btrfs_trim_fs() holds the device_list_mutex for the entire duration while iterating through all devices. On large filesystems with significant unallocated space, this operation can take minutes to hours on large storage systems. This causes a problem because btrfs_run_dev_stats(), which is called during transaction commit, also requires device_list_mutex: btrfs_trim_fs() mutex_lock(&fs_devices->device_list_mutex) list_for_each_entry(device, ...) btrfs_trim_free_extents(device) mutex_unlock(&fs_devices->device_list_mutex) commit_transaction() btrfs_run_dev_stats() mutex_lock(&fs_devices->device_list_mutex) // blocked! ... While trim is running, all transaction commits are blocked waiting for the mutex. Fix this by refactoring btrfs_trim_free_extents() to process devices in bounded chunks (up to 2GB per iteration) and release device_list_mutex between chunks. Signed-off-by: robbieko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: handle user interrupt properly in btrfs_trim_fs()jinbaohong
When a fatal signal is pending or the process is freezing, btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS. Currently this is treated as a regular error: the loops continue to the next iteration and count it as a block group or device failure. Instead, break out of the loops immediately and return -ERESTARTSYS to userspace without counting it as a failure. Also skip the device loop entirely if the block group loop was interrupted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: preserve first error in btrfs_trim_fs()jinbaohong
When multiple block groups or devices fail during trim, preserve the first error encountered rather than the last one. The first error is typically more useful for debugging as it represents the original failure, while subsequent errors may be cascading effects. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: continue trimming remaining devices on failurejinbaohong
Commit 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle error better") intended to make device trimming continue even if one device fails, tracking failures and reporting them at the end. However, it used 'break' instead of 'continue', causing the loop to exit on the first device failure. Fix this by replacing 'break' with 'continue'. Fixes: 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle error better") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: do not BUG_ON() in btrfs_remove_block_group()Filipe Manana
There's no need to BUG_ON(), we can just abort the transaction and return an error. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: abort transaction on error in btrfs_remove_block_group()Filipe Manana
When btrfs_remove_block_group() fails we abort the transaction in its single caller (btrfs_remove_chunk()). This makes it harder to find out where exactly the failure happened, as several steps inside btrfs_remove_block_group() can fail. So make btrfs_remove_block_group() abort the transaction whenever an error happens, instead of aborting in its caller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: fix block_group_tree dirty_list corruptionBoris Burkov
When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the block group tree to the switch_commits list before calling switch_commit_roots, as we do for the tree root and the chunk root. However, the block group tree uses normal root dirty tracking and in any transaction that does an allocation and dirties a block group, the block group root will already be linked to a list by the dirty_list field and this use of list_add_tail() is invalid and corrupts the prev/next members of block_group_root->dirty_list. This is apparent on a subsequent list_del on the prev if we enable CONFIG_DEBUG_LIST: [32.1571] ------------[ cut here ]------------ [32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538) [32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607 [32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none) [32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014 [32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120 [32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202 [32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538 [32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00 [32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff [32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538 [32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538 [32.1603] FS: 00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000 [32.1605] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0 [32.1609] Call Trace: [32.1610] <TASK> [32.1611] switch_commit_roots+0x82/0x1d0 [btrfs] [32.1615] btrfs_commit_transaction+0x968/0x1550 [btrfs] [32.1618] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [32.1621] __iterate_supers+0xe8/0x190 [32.1622] ? __pfx_sync_fs_one_sb+0x10/0x10 [32.1623] ksys_sync+0x63/0xb0 [32.1624] __do_sys_sync+0xe/0x20 [32.1625] do_syscall_64+0x73/0x450 [32.1626] entry_SYSCALL_64_after_hwframe+0x76/0x7e [32.1627] RIP: 0033:0x7f0c28d05d2b [32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2 [32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b [32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d [32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000 [32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80 [32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034 [32.1643] </TASK> [32.1644] irq event stamp: 0 [32.1644] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1648] softirqs last enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260 [32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0 [32.1652] ---[ end trace 0000000000000000 ]--- Furthermore, this list corruption eventually (when we happen to add a new block group) results in getting the switch_commits and dirty_cowonly_roots lists mixed up and attempting to call update_root on the tree root which can't be found in the tree root, resulting in a transaction abort: [87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1 [87.8272] ------------[ cut here ]------------ [87.8274] BTRFS: Transaction aborted (error -117) [87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703 [87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none) [87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014 [87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs] [87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282 [87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000 [87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270 [87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff [87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001 [87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0 [87.8307] FS: 00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000 [87.8309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0 [87.8312] Call Trace: [87.8313] <TASK> [87.8314] ? _raw_spin_unlock+0x23/0x40 [87.8315] commit_cowonly_roots+0x1ad/0x250 [btrfs] [87.8317] ? btrfs_commit_transaction+0x79b/0x1560 [btrfs] [87.8320] btrfs_commit_transaction+0x8aa/0x1560 [btrfs] [87.8322] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs] [87.8325] __iterate_supers+0xf1/0x170 [87.8326] ? __pfx_sync_fs_one_sb+0x10/0x10 [87.8327] ksys_sync+0x63/0xb0 [87.8328] __do_sys_sync+0xe/0x20 [87.8329] do_syscall_64+0x73/0x450 [87.8330] entry_SYSCALL_64_after_hwframe+0x76/0x7e [87.8331] RIP: 0033:0x7f54cdd05d2b [87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 [87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b [87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d [87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80 [87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034 [87.8348] </TASK> [87.8348] irq event stamp: 0 [87.8349] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8353] softirqs last enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0 [87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0 [87.8357] ---[ end trace 0000000000000000 ]--- [87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted [87.8360] BTRFS info (device nvme1n1 state EA): forced readonly [87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction. [87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted Since the block group tree was pulled out of the extent tree and uses normal root dirty tracking, remove the offending extra list_add. This fixes the list corruption and the resulting fs corruption. Fixes: 14033b08a029 ("btrfs: don't save block group root into super block") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: fix copying the flags of btrfs_bio after splitJohannes Thumshirn
When a btrfs_bio gets split, only 'bbio->csum_search_commit_root' gets copied to the new btrfs_bio, all the other flags don't. When a bio is split in btrfs_submit_chunk(), btrfs_split_bio() creates the new split bio via btrfs_bio_init() which zeroes the struct with memset. Looking at btrfs_split_bio(), it copies csum_search_commit_root from the original but does not copy can_use_append. After the split, the code does: bbio = split; bio = &bbio->bio; This means the split bio (with can_use_append = false) gets submitted, not the original. In btrfs_submit_dev_bio(), the condition: if (btrfs_bio(bio)->can_use_append && btrfs_dev_is_sequential(...)) Will be false for the split bio even when writing to a sequential zone. Does the split bio need to inherit can_use_append from the original? The old code used a local variable use_append which persisted across the split. Copy the rest of the flags as well. Link: https://lore.kernel.org/linux-btrfs/20260125132120.2525146-1-clm@meta.com/ Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: use local fs_info variable in btrfs_load_block_group_dup()Johannes Thumshirn
btrfs_load_block_group_dup() has a local pointer to fs_info, yet the error prints dereference fs_info from the block_group. Use local fs_info variable to make the code more uniform. Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10Naohiro Aota
When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: fixup last alloc pointer after extent removal for DUPNaohiro Aota
When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: c0d90a79e8e6 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups") CC: stable@vger.kernel.org # 6.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: zoned: fixup last alloc pointer after extent removal for RAID1Naohiro Aota
When a block group is composed of a sequential write zone and a conventional zone, we recover the (pseudo) write pointer of the conventional zone using the end of the last allocated position. However, if the last extent in a block group is removed, the last extent position will be smaller than the other real write pointer position. Then, that will cause an error due to mismatch of the write pointers. We can fixup this case by moving the alloc_offset to the corresponding write pointer position. Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove out label in btrfs_wait_for_commit()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove out label in btrfs_init_space_info()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove out label in btrfs_check_rw_degradable()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove out label in finish_verity()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove out label in scrub_find_fill_first_stripe()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove out label in lzo_decompress()Filipe Manana
There is no point in having the label since all it does is return the value in the 'ret' variable. Instead make every goto return directly and remove the label. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>