kernel/fs/btrfs/extent_io.c, branch linux-6.18.y

btrfs: add missing RCU unlock in error path in try_release_subpage_extent_buffer()

2026-03-19T15:08:45Z

commit b2840e33127ce0eea880504b7f133e780f567a9b upstream. Call rcu_read_lock() before exiting the loop in try_release_subpage_extent_buffer() because there is a rcu_read_unlock() call past the loop. This has been detected by the Clang thread-safety analyzer. Fixes: ad580dfa388f ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") CC: stable@vger.kernel.org # 6.18+ Reviewed-by: Qu Wenruo Reviewed-by: Boris Burkov Signed-off-by: Bart Van Assche Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman

btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode

2026-02-26T22:58:58Z

[ Upstream commit 81cea6cd7041ebd42281e0517f856d88527d3326 ] Currently there is only one caller which doesn't populate btrfs_bio::inode, and that's scrub. The idea is scrub doesn't want any automatic csum verification nor read-repair, as everything will be handled by scrub itself. However that behavior is really no different than metadata inode, thus we can reuse btree_inode as btrfs_bio::inode for scrub. The only exception is in btrfs_submit_chunk() where if a bbio is from scrub or data reloc inode, we set rst_search_commit_root to true. This means we still need a way to distinguish scrub from metadata, but that can be done by a new flag inside btrfs_bio. Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from btrfs_bio structure. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Stable-dep-of: b39b26e017c7 ("btrfs: zoned: don't zone append to conventional zone") Signed-off-by: Sasha Levin

btrfs: do not strictly require dirty metadata threshold for metadata writepages

2026-02-06T15:57:40Z

commit 4e159150a9a56d66d247f4b5510bed46fe58aa1c upstream. [BUG] There is an internal report that over 1000 processes are waiting at the io_schedule_timeout() of balance_dirty_pages(), causing a system hang and trigger a kernel coredump. The kernel is v6.4 kernel based, but the root problem still applies to any upstream kernel before v6.18. [CAUSE] From Jan Kara for his wisdom on the dirty page balance behavior first. This cgroup dirty limit was what was actually playing the role here because the cgroup had only a small amount of memory and so the dirty limit for it was something like 16MB. Dirty throttling is responsible for enforcing that nobody can dirty (significantly) more dirty memory than there's dirty limit. Thus when a task is dirtying pages it periodically enters into balance_dirty_pages() and we let it sleep there to slow down the dirtying. When the system is over dirty limit already (either globally or within a cgroup of the running task), we will not let the task exit from balance_dirty_pages() until the number of dirty pages drops below the limit. So in this particular case, as I already mentioned, there was a cgroup with relatively small amount of memory and as a result with dirty limit set at 16MB. A task from that cgroup has dirtied about 28MB worth of pages in btrfs btree inode and these were practically the only dirty pages in that cgroup. So that means the only way to reduce the dirty pages of that cgroup is to writeback the dirty pages of btrfs btree inode, and only after that those processes can exit balance_dirty_pages(). Now back to the btrfs part, btree_writepages() is responsible for writing back dirty btree inode pages. The problem here is, there is a btrfs internal threshold that if the btree inode's dirty bytes are below the 32M threshold, it will not do any writeback. This behavior is to batch as much metadata as possible so we won't write back those tree blocks and then later re-COW them again for another modification. This internal 32MiB is higher than the existing dirty page size (28MiB), meaning no writeback will happen, causing a deadlock between btrfs and cgroup: - Btrfs doesn't want to write back btree inode until more dirty pages - Cgroup/MM doesn't want more dirty pages for btrfs btree inode Thus any process touching that btree inode is put into sleep until the number of dirty pages is reduced. Thanks Jan Kara a lot for the analysis of the root cause. [ENHANCEMENT] Since kernel commit b55102826d7d ("btrfs: set AS_KERNEL_FILE on the btree_inode"), btrfs btree inode pages will only be charged to the root cgroup which should have a much larger limit than btrfs' 32MiB threshold. So it should not affect newer kernels. But for all current LTS kernels, they are all affected by this problem, and backporting the whole AS_KERNEL_FILE may not be a good idea. Even for newer kernels I still think it's a good idea to get rid of the internal threshold at btree_writepages(), since for most cases cgroup/MM has a better view of full system memory usage than btrfs' fixed threshold. For internal callers using btrfs_btree_balance_dirty() since that function is already doing internal threshold check, we don't need to bother them. But for external callers of btree_writepages(), just respect their requests and write back whatever they want, ignoring the internal btrfs threshold to avoid such deadlock on btree inode dirty page balancing. CC: stable@vger.kernel.org CC: Jan Kara Reviewed-by: Boris Burkov Signed-off-by: Qu Wenruo Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman

btrfs: fix beyond-EOF write handling

2026-01-17T15:35:31Z

[ Upstream commit e9e3b22ddfa760762b696ac6417c8d6edd182e49 ] [BUG] For the following write sequence with 64K page size and 4K fs block size, it will lead to file extent items to be inserted without any data checksum: mkfs.btrfs -s 4k -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \ -c "truncate 16k" $mnt/foobar umount $mnt This will result the following 2 file extent items to be inserted (extra trace point added to insert_ordered_extent_file_extent()): btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0 btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384 Note for file offset 60K, we're inserting a file extent without any data checksum. Also note that range [32K, 36K) didn't reach insert_ordered_extent_file_extent(), which is the correct behavior as that OE is fully truncated, should not result any file extent. Although file extent at 60K will be later dropped by btrfs_truncate(), if the transaction got committed after file extent inserted but before the file extent dropping, we will have a small window where we have a file extent beyond EOF and without any data checksum. That will cause "btrfs check" to report error. [CAUSE] The sequence happens like this: - Buffered write dirtied the page cache and updated isize Now the inode size is 64K, with the following page cache layout: 0 16K 32K 48K 64K |/////////////| |//| |//| - Truncate the inode to 16K Which will trigger writeback through: btrfs_setsize() |- truncate_setsize() | Now the inode size is set to 16K | |- btrfs_truncate() |- btrfs_wait_ordered_range() for [16K, u64(-1)] |- btrfs_fdatawrite_range() for [16K, u64(-1)} |- extent_writepage() for folio 0 |- writepage_delalloc() | Generated OE for [0, 16K), [32K, 36K] and [60K, 64K) | |- extent_writepage_io() Then inside extent_writepage_io(), the dirty fs blocks are handled differently: - Submit write for range [0, 16K) As they are still inside the inode size (16K). - Mark OE [32K, 36K) as truncated Since we only call btrfs_lookup_first_ordered_range() once, which returned the first OE after file offset 16K. - Mark all OEs inside range [16K, 64K) as finished Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished. For OE [32K, 36K) since it's already marked as truncated, and its truncated length is 0, no file extent will be inserted. For OE [60K, 64K) it has never been submitted thus has no data checksum, and we insert the file extent as usual. This is the root cause of file extent at 60K to be inserted without any data checksum. - Clear dirty flags for range [16K, 64K) It is the function btrfs_folio_clear_dirty() which searches and clears any dirty blocks inside that range. [FIX] The bug itself was introduced a long time ago, way before subpage and large folio support. At that time, fs block size must match page size, thus the range [cur, end) is just one fs block. But later with subpage and large folios, the same range [cur, end) can have multiple blocks and ordered extents. Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping writeback past i_size") was fixing a bug related to subpage/large folios, but it's still utilizing the old range [cur, end), meaning only the first OE will be marked as truncated. The proper fix here is to make EOF handling block-by-block, not trying to handle the whole range to @end. By this we always locate and truncate the OE for every dirty block. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

btrfs: use variable for end offset in extent_writepage_io()

2026-01-17T15:35:30Z

[ Upstream commit 46a23908598f4b8e61483f04ea9f471b2affc58a ] Instead of repeating the expression "start + len" multiple times, store it in a variable and use it where needed. Reviewed-by: Qu Wenruo Reviewed-by: Anand Jain Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

btrfs: truncate ordered extent when skipping writeback past i_size

2026-01-17T15:35:30Z

[ Upstream commit 18de34daa7c62c830be533aace6b7c271e8e95cf ] While running test case btrfs/192 from fstests with support for large folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic btrfs check failures reporting that csum items were missing. Looking into the issue it turned out that btrfs check searches for csum items of a file extent item with a range that spans beyond the i_size of a file and we don't have any, because the kernel's writeback code skips submitting bios for ranges beyond eof. It's not expected however to find a file extent item that crosses the rounded up (by the sector size) i_size value, but there is a short time window where we can end up with a transaction commit leaving this small inconsistency between the i_size and the last file extent item. Example btrfs check output when this happens: $ btrfs check /dev/sdc Opening filesystem to check... Checking filesystem on /dev/sdc UUID: 69642c61-5efb-4367-aa31-cdfd4067f713 [1/8] checking log skipped (none written) [2/8] checking root items [3/8] checking extents [4/8] checking free space tree [5/8] checking fs roots root 5 inode 332 errors 1000, some csum missing ERROR: errors found in fs roots (...) Looking at a tree dump of the fs tree (root 5) for inode 332 we have: $ btrfs inspect-internal dump-tree -t 5 /dev/sdc (...) item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160 generation 17 transid 19 size 610969 nbytes 86016 block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0 sequence 11 flags 0x0(none) atime 1759851068.391327881 (2025-10-07 16:31:08) ctime 1759851068.410098267 (2025-10-07 16:31:08) mtime 1759851068.410098267 (2025-10-07 16:31:08) otime 1759851068.391327881 (2025-10-07 16:31:08) item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13 index 2 namelen 3 name: f1f item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53 generation 19 type 1 (regular) extent data disk byte 21745664 nr 65536 extent data offset 0 nr 65536 ram 65536 extent compression 0 (none) (...) We can see that the file extent item for file offset 589824 has a length of 64K and its number of bytes is 64K. Looking at the inode item we see that its i_size is 610969 bytes which falls within the range of that file extent item [589824, 655360[. Looking into the csum tree: $ btrfs inspect-internal dump-tree /dev/sdc (...) item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200 range start 21565440 end 21770240 length 204800 item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8 range start 1104576512 end 1104584704 length 8192 (..) We see that the csum item number 15 covers the first 24K of the file extent item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664, so we have: 21770240 - 21745664 = 24K We see that the next csum item (number 16) is completely outside the range, so the remaining 40K of the extent doesn't have csum items in the tree. If we round up the i_size to the sector size, we get: round_up(610969, 4096) = 614400 If we subtract from that the file offset for the extent item we get: 614400 - 589824 = 24K So the missing 40K corresponds to the end of the file extent item's range minus the rounded up i_size: 655360 - 614400 = 40K Normally we don't expect a file extent item to span over the rounded up i_size of an inode, since when truncating, doing hole punching and other operations that trim a file extent item, the number of bytes is adjusted. There is however a short time window where the kernel can end up, temporarily,persisting an inode with an i_size that falls in the middle of the last file extent item and the file extent item was not yet trimmed (its number of bytes reduced so that it doesn't cross i_size rounded up by the sector size). The steps (in the kernel) that lead to such scenario are the following: 1) We have inode I as an empty file, no allocated extents, i_size is 0; 2) A buffered write is done for file range [589824, 655360[ (length of 64K) and the i_size is updated to 655360. Note that we got a single large folio for the range (64K); 3) A truncate operation starts that reduces the inode's i_size down to 610969 bytes. The truncate sets the inode's new i_size at btrfs_setsize() by calling truncate_setsize() and before calling btrfs_truncate(); 4) At btrfs_truncate() we trigger writeback for the range starting at 610304 (which is the new i_size rounded down to the sector size) and ending at (u64)-1; 5) During the writeback, at extent_write_cache_pages(), we get from the call to filemap_get_folios_tag(), the 64K folio that starts at file offset 589824 since it contains the start offset of the writeback range (610304); 6) At writepage_delalloc() we find the whole range of the folio is dirty and therefore we run delalloc for that 64K range ([589824, 655360[), reserving a 64K extent, creating an ordered extent, etc; 7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[ because the inode's i_size is 610969 bytes (rounded up by sector size is 614400). There, in the while loop we intentionally skip IO beyond i_size to avoid any unnecessay work and just call btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which has a 40K length); 8) Once the IO finishes we finish the ordered extent by ending up at btrfs_finish_one_ordered(), join transaction N, insert a file extent item in the inode's subvolume tree for file offset 589824 with a number of bytes of 64K, and update the inode's delayed inode item or directly the inode item with a call to btrfs_update_inode_fallback(), which results in storing the new i_size of 610969 bytes; 9) Transaction N is committed either by the transaction kthread or some other task committed it (in response to a sync or fsync for example). At this point we have inode I persisted with an i_size of 610969 bytes and file extent item that starts at file offset 589824 and has a number of bytes of 64K, ending at an offset of 655360 which is beyond the i_size rounded up to the sector size (614400). --> So after a crash or power failure here, the btrfs check program reports that error about missing checksum items for this inode, as it tries to lookup for checksums covering the whole range of the extent; 10) Only after transaction N is committed that at btrfs_truncate() the call to btrfs_start_transaction() starts a new transaction, N + 1, instead of joining transaction N. And it's with transaction N + 1 that it calls btrfs_truncate_inode_items() which updates the file extent item at file offset 589824 to reduce its number of bytes from 64K down to 24K, so that the file extent item's range ends at the i_size rounded up to the sector size (614400 bytes). Fix this by truncating the ordered extent at extent_writepage_io() when we skip writeback because the current offset in the folio is beyond i_size. This ensures we don't ever persist a file extent item with a number of bytes beyond the rounded up (by sector size) value of the i_size. Reviewed-by: Qu Wenruo Reviewed-by: Anand Jain Signed-off-by: Filipe Manana Signed-off-by: David Sterba Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

btrfs: ensure no dirty metadata is written back for an fs with errors

2025-10-30T18:16:01Z

[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers(). It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free. [CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state. But there are some metadata modifications before that error, and they are still in the btree inode page cache. Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty. And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata. And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free. [FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them. Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues. The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba

btrfs: fix incorrect readahead expansion length

2025-10-13T20:34:08Z

The intent of btrfs_readahead_expand() was to expand to the length of the current compressed extent being read. However, "ram_bytes" is *not* that, in the case where a single physical compressed extent is used for multiple file extents. Consider this case with a large compressed extent C and then later two non-compressed extents N1 and N2 written over C, leaving C1 and C2 pointing to offset/len pairs of C: [ C ] [ N1 ][ C1 ][ N2 ][ C2 ] In such a case, ram_bytes for both C1 and C2 is the full uncompressed length of C. So starting readahead in C1 will expand the readahead past the end of C1, past N2, and into C2. This will then expand readahead again, to C2_start + ram_bytes, way past EOF. First of all, this is totally undesirable, we don't want to read the whole file in arbitrary chunks of the large underlying extent if it happens to exist. Secondly, it results in zeroing the range past the end of C2 up to ram_bytes. This is particularly unpleasant with fs-verity as it can zero and set uptodate pages in the verity virtual space past EOF. This incorrect readahead behavior can lead to verity verification errors, if we iterate in a way that happens to do the wrong readahead. Fix this by using em->len for readahead expansion, not em->ram_bytes, resulting in the expected behavior of stopping readahead at the extent boundary. Reported-by: Max Chernoff Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898 Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents") CC: stable@vger.kernel.org # 6.17 Reviewed-by: Filipe Manana Signed-off-by: Boris Burkov Signed-off-by: David Sterba

btrfs: add unlikely annotations to branches leading to EIO

2025-09-23T06:49:26Z

The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana Signed-off-by: David Sterba

btrfs: prepare compression folio alloc/free for bs > ps cases

2025-09-23T06:49:24Z

This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba