summaryrefslogtreecommitdiff
path: root/fs/erofs
AgeCommit message (Collapse)Author
2026-03-25erofs: fix .fadvise() for page cache sharingGao Xiang
Currently, .fadvise() doesn't work well if page cache sharing is on since shared inodes belong to a pseudo fs generated with init_pseudo(), and sb->s_bdi is the default one &noop_backing_dev_info. Then, generic_fadvise() will just behave as a no-op if sb->s_bdi is &noop_backing_dev_info, but as the bdev fs (the bdev fs changes inode_to_bdi() instead), it's actually NOT a pure memfs. Let's generate a real bdi for erofs_ishare_mnt instead. Fixes: d86d7817c042 ("erofs: implement .fadvise for page cache share") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-03-25erofs: update the Kconfig descriptionGao Xiang
Refine the description to better highlight its features and use cases. In addition, add instructions for building it as a module and clarify the compression option. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-03-19erofs: add GFP_NOIO in the bio completion if neededJiucheng Xu
The bio completion path in the process context (e.g. dm-verity) will directly call into decompression rather than trigger another workqueue context for minimal scheduling latencies, which can then call vm_map_ram() with GFP_KERNEL. Due to insufficient memory, vm_map_ram() may generate memory swapping I/O, which can cause submit_bio_wait to deadlock in some scenarios. Trimmed down the call stack, as follows: f2fs_submit_read_io submit_bio //bio_list is initialized. mmc_blk_mq_recovery z_erofs_endio vm_map_ram __pte_alloc_kernel __alloc_pages_direct_reclaim shrink_folio_list __swap_writepage submit_bio_wait //bio_list is non-NULL, hang!!! Use memalloc_noio_{save,restore}() to wrap up this path. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-03-17erofs: set fileio bio failed in short read caseSheng Yong
For file-backed mount, IO requests are handled by vfs_iocb_iter_read(). However, it can be interrupted by SIGKILL, returning the number of bytes actually copied. Unused folios in bio are unexpectedly marked as uptodate. vfs_read filemap_read filemap_get_pages filemap_readahead erofs_fileio_readahead erofs_fileio_rq_submit vfs_iocb_iter_read filemap_read filemap_get_pages <= detect signal erofs_fileio_ki_complete <= set all folios uptodate This patch addresses this by setting short read bio with an error directly. Fixes: bc804a8d7e86 ("erofs: handle end of filesystem properly for file-backed mounts") Reported-by: chenguanyou <chenguanyou@xiaomi.com> Signed-off-by: Yunlei He <heyunlei@xiaomi.com> Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-25erofs: fix interlaced plain identification for encoded extentsGao Xiang
Only plain data whose start position and on-disk physical length are both aligned to the block size should be classified as interlaced plain extents. Otherwise, it must be treated as shifted plain extents. This issue was found by syzbot using a crafted compressed image containing plain extents with unaligned physical lengths, which can cause OOB read in z_erofs_transform_plain(). Reported-and-tested-by: syzbot+d988dc155e740d76a331@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/699d5714.050a0220.cdd3c.03e7.GAE@google.com Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-24erofs: remove more unnecessary #ifdefsFerry Meng
Many #ifdefs can be replaced with IS_ENABLED() to improve code readability. No functional changes. Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-23erofs: allow sharing page cache with the same aops onlyHongbo Li
Inode with identical data but different @aops cannot be mixed because the page cache is managed by different subsystems (e.g., @aops for compressed on-disk inodes cannot handle plain on-disk inodes). In this patch, we never allow inodes to share the page cache among plain, compressed, and fileio cases. When a shared inode is created, we initialize @aops that is the same as the initial real inode, and subsequent inodes cannot share the page cache if the inferred @aops differ from the corresponding shared inode. This is reasonable as a first step because, in typical use cases, if an inode is compressible, it will fall into compressed inodes across different filesystem images unless users use plain filesystems. However, in that cases, users will use plain filesystems all the time. Fixes: 5ef3208e3be5 ("erofs: introduce the page cache share feature") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-18Merge tag 'mm-stable-2026-02-18-19-48' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ...
2026-02-12mm: update all remaining mmap_prepare users to use vma_flags_tLorenzo Stoakes
We will be shortly removing the vm_flags_t field from vm_area_desc so we need to update all mmap_prepare users to only use the dessc->vma_flags field. This patch achieves that and makes all ancillary changes required to make this possible. This lays the groundwork for future work to eliminate the use of vm_flags_t in vm_area_desc altogether and more broadly throughout the kernel. While we're here, we take the opportunity to replace VM_REMAP_FLAGS with VMA_REMAP_FLAGS, the vma_flags_t equivalent. No functional changes intended. Link: https://lkml.kernel.org/r/fb1f55323799f09fe6a36865b31550c9ec67c225.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Damien Le Moal <dlemoal@kernel.org> [zonefs] Acked-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-09Merge tag 'erofs-for-7.0-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, inode page cache sharing among filesystems on the same machine is now supported, which is particularly useful for high-density hosts running tens of thousands of containers. In addition, we fully isolate the EROFS core on-disk format from other optional encoded layouts since the core on-disk part is designed to be simple, effective, and secure. Users can use the core format to build unique golden immutable images and import their filesystem trees directly from raw block devices via DMA, page-mapped DAX devices, and/or file-backed mounts without having to worry about unnecessary intrinsic consistency issues found in other generic filesystems by design. However, the full vision is still working in progress and will spend more time to achieve final goals. There are other improvements and bug fixes as usual, as listed below: - Support inode page cache sharing among filesystems - Formally separate optional encoded (aka compressed) inode layouts (and the implementations) from the EROFS core on-disk aligned plain format for future zero-trust security usage - Improve performance by caching the fact that an inode does not have a POSIX ACL - Improve LZ4 decompression error reporting - Enable LZMA by default and promote DEFLATE and Zstandard algorithms out of EXPERIMENTAL status - Switch to inode_set_cached_link() to cache symlink lengths - random bugfixes and minor cleanups" * tag 'erofs-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (31 commits) erofs: fix UAF issue for file-backed mounts w/ directio option erofs: update compression algorithm status erofs: fix inline data read failure for ztailpacking pclusters erofs: avoid some unnecessary #ifdefs erofs: handle end of filesystem properly for file-backed mounts erofs: separate plain and compressed filesystems formally erofs: use inode_set_cached_link() erofs: mark inodes without acls in erofs_read_inode() erofs: implement .fadvise for page cache share erofs: support compressed inodes for page cache share erofs: support unencoded inodes for page cache share erofs: pass inode to trace_erofs_read_folio erofs: introduce the page cache share feature erofs: using domain_id in the safer way erofs: add erofs_inode_set_aops helper to set the aops erofs: support user-defined fingerprint name erofs: decouple `struct erofs_anon_fs_type` fs: Export alloc_empty_backing_file erofs: tidy up erofs_init_inode_xattrs() erofs: add missing documentation about `directio` mount option ...
2026-02-09Merge tag 'vfs-7.0-rc1.iomap' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: - Erofs page cache sharing preliminaries: Plumb a void *private parameter through iomap_read_folio() and iomap_readahead() into iomap_iter->private, matching iomap DIO. Erofs uses this to replace a bogus kmap_to_page() call, as preparatory work for page cache sharing. - Fix for invalid folio access: Fix an invalid folio access when a folio without iomap_folio_state is fully submitted to the IO helper — the helper may call folio_end_read() at any time, so ctx->cur_folio must be invalidated after full submission. * tag 'vfs-7.0-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: fix invalid folio access after folio_end_read() erofs: hold read context in iomap_iter if needed iomap: stash iomap read ctx in the private field of iomap_iter
2026-02-09Merge tag 'vfs-7.0-rc1.fserror' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs error reporting updates from Christian Brauner: "This contains the changes to support generic I/O error reporting. Filesystems currently have no standard mechanism for reporting metadata corruption and file I/O errors to userspace via fsnotify. Each filesystem (xfs, ext4, erofs, f2fs, etc.) privately defines EFSCORRUPTED, and error reporting to fanotify is inconsistent or absent entirely. This introduces a generic fserror infrastructure built around struct super_block that gives filesystems a standard way to queue metadata and file I/O error reports for delivery to fsnotify. Errors are queued via mempools and queue_work to avoid holding filesystem locks in the notification path; unmount waits for pending events to drain. A new super_operations::report_error callback lets filesystem drivers respond to file I/O errors themselves (to be used by an upcoming XFS self-healing patchset). On the uapi side, EFSCORRUPTED and EUCLEAN are promoted from private per-filesystem definitions to canonical errno.h values across all architectures" * tag 'vfs-7.0-rc1.fserror' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: convert to new fserror helpers xfs: translate fsdax media errors into file "data lost" errors when convenient xfs: report fs metadata errors via fsnotify iomap: report file I/O errors to the VFS fs: report filesystem and file I/O errors to fsnotify uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
2026-02-06erofs: fix UAF issue for file-backed mounts w/ directio optionChao Yu
[ 9.269940][ T3222] Call trace: [ 9.269948][ T3222] ext4_file_read_iter+0xac/0x108 [ 9.269979][ T3222] vfs_iocb_iter_read+0xac/0x198 [ 9.269993][ T3222] erofs_fileio_rq_submit+0x12c/0x180 [ 9.270008][ T3222] erofs_fileio_submit_bio+0x14/0x24 [ 9.270030][ T3222] z_erofs_runqueue+0x834/0x8ac [ 9.270054][ T3222] z_erofs_read_folio+0x120/0x220 [ 9.270083][ T3222] filemap_read_folio+0x60/0x120 [ 9.270102][ T3222] filemap_fault+0xcac/0x1060 [ 9.270119][ T3222] do_pte_missing+0x2d8/0x1554 [ 9.270131][ T3222] handle_mm_fault+0x5ec/0x70c [ 9.270142][ T3222] do_page_fault+0x178/0x88c [ 9.270167][ T3222] do_translation_fault+0x38/0x54 [ 9.270183][ T3222] do_mem_abort+0x54/0xac [ 9.270208][ T3222] el0_da+0x44/0x7c [ 9.270227][ T3222] el0t_64_sync_handler+0x5c/0xf4 [ 9.270253][ T3222] el0t_64_sync+0x1bc/0x1c0 EROFS may encounter above panic when enabling file-backed mount w/ directio mount option, the root cause is it may suffer UAF in below race condition: - z_erofs_read_folio wq s_dio_done_wq - z_erofs_runqueue - erofs_fileio_submit_bio - erofs_fileio_rq_submit - vfs_iocb_iter_read - ext4_file_read_iter - ext4_dio_read_iter - iomap_dio_rw : bio was submitted and return -EIOCBQUEUED - dio_aio_complete_work - dio_complete - dio->iocb->ki_complete (erofs_fileio_ki_complete()) - kfree(rq) : it frees iocb, iocb.ki_filp can be UAF in file_accessed(). - file_accessed : access NULL file point Introduce a reference count in struct erofs_fileio_rq, and initialize it as two, both erofs_fileio_ki_complete() and erofs_fileio_rq_submit() will decrease reference count, the last one decreasing the reference count to zero will free rq. Cc: stable@kernel.org Fixes: fb176750266a ("erofs: add file-backed mount support") Fixes: 6422cde1b0d5 ("erofs: use buffered I/O for file-backed mounts by default") Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-05erofs: update compression algorithm statusGao Xiang
The following changes are proposed in the upcoming Linux 7.0: - Enable LZMA support by default, as it's already in use by Fedora 42/43 and some Android vendors for minimal filesystem sizes; - Promote DEFLATE and Zstandard out of EXPERIMENTAL status, given that they have been landed and well-tested for over a year and are already ready for general use. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-05erofs: fix inline data read failure for ztailpacking pclustersGao Xiang
Compressed folios for ztailpacking pclusters must be valid before adding these pclusters to I/O chains. Otherwise, z_erofs_decompress_pcluster() may assume they are already valid and then trigger a NULL pointer dereference. It is somewhat hard to reproduce because the inline data is in the same block as the tail of the compressed indexes, which are usually read just before. However, it may still happen if a fatal signal arrives while read_mapping_folio() is running, as shown below: erofs: (device dm-1): z_erofs_pcluster_begin: failed to get inline data -4 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 ... pc : z_erofs_decompress_queue+0x4c8/0xa14 lr : z_erofs_decompress_queue+0x160/0xa14 sp : ffffffc08b3eb3a0 x29: ffffffc08b3eb570 x28: ffffffc08b3eb418 x27: 0000000000001000 x26: ffffff8086ebdbb8 x25: ffffff8086ebdbb8 x24: 0000000000000001 x23: 0000000000000008 x22: 00000000fffffffb x21: dead000000000700 x20: 00000000000015e7 x19: ffffff808babb400 x18: ffffffc089edc098 x17: 00000000c006287d x16: 00000000c006287d x15: 0000000000000004 x14: ffffff80ba8f8000 x13: 0000000000000004 x12: 00000006589a77c9 x11: 0000000000000015 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003f x5 : 0000000000000040 x4 : ffffffffffffffe0 x3 : 0000000000000020 x2 : 0000000000000008 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: z_erofs_decompress_queue+0x4c8/0xa14 z_erofs_runqueue+0x908/0x97c z_erofs_read_folio+0x128/0x228 filemap_read_folio+0x68/0x128 filemap_get_pages+0x44c/0x8b4 filemap_read+0x12c/0x5b8 generic_file_read_iter+0x4c/0x15c do_iter_readv_writev+0x188/0x1e0 vfs_iter_read+0xac/0x1a4 backing_file_read_iter+0x170/0x34c ovl_read_iter+0xf0/0x140 vfs_read+0x28c/0x344 ksys_read+0x80/0xf0 __arm64_sys_read+0x24/0x34 invoke_syscall+0x60/0x114 el0_svc_common+0x88/0xe4 do_el0_svc+0x24/0x30 el0_svc+0x40/0xa8 el0t_64_sync_handler+0x70/0xbc el0t_64_sync+0x1bc/0x1c0 Fix this by reading the inline data before allocating and adding the pclusters to the I/O chains. Fixes: cecf864d3d76 ("erofs: support inline data decompression") Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-and-tested-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-03erofs: avoid some unnecessary #ifdefsFerry Meng
They can either be removed or replaced with IS_ENABLED(). Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-03erofs: handle end of filesystem properly for file-backed mountsGao Xiang
I/O requests beyond the end of the filesystem should be zeroed out, similar to loopback devices and that is what we expect. Fixes: ce63cb62d794 ("erofs: support unencoded inodes for fileio") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-03erofs: separate plain and compressed filesystems formallyGao Xiang
The EROFS on-disk format uses a tiny, plain metadata design that prioritizes performance and minimizes complex inconsistencies against common writable disk filesystems (almost all serious metadata inconsistency cannot happen in well-designed immutable filesystems like EROFS). EROFS deliberately avoids artificial design flaws to eliminate serious security risks from untrusted remote sources by design, although human-made implementation bugs can still happen sometimes. Currently, there is no strict check to prevent compressed inodes, especially LZ4-compressed inodes, from being read in plain filesystems. Starting with erofs-utils 1.0 and Linux 5.3, LZ4_0PADDING sb feature is automatically enabled for LZ4-compressed EROFS images to support in-place decompression. Furthermore, since Linux 5.4 LTS is no longer supported, we no longer need to handle ancient LZ4-compressed EROFS images generated by erofs-utils prior to 1.0. To formally distinguish different filesystem types for improved security: - Use the presence of LZ4_0PADDING or a non-zero `dsb->u1.lz4_max_distance` as a marker for compressed filesystems containing LZ4-compressed inodes only; - For other algorithms, use `dsb->u1.available_compr_algs` bitmap. Note: LZ4_0PADDING has been supported since Linux 5.4 (the first formal kernel version), so exposing it via sysfs is no longer necessary and is now deprecated (but remain it for five more years until 2031): `dsb->u1` has been strictly non-zero for all EROFS images containing compressed inodes starting with erofs-utils v1.3 and it is actually a much better marker for compressed filesystems. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-02-03erofs: use inode_set_cached_link()Gao Xiang
Symlink lengths are now cached in in-memory inodes directly so that readlink can be sped up. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-28erofs: mark inodes without acls in erofs_read_inode()Gao Xiang
Similar to commit 91ef18b567da ("ext4: mark inodes without acls in __ext4_iget()"), the ACL state won't be read when the file owner performs a lookup, and the RCU fast path for lookups won't work because the ACL state remains unknown. If there are no extended attributes, or if the xattr filter indicates that no ACL xattr is present, call cache_no_acl() directly. Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: implement .fadvise for page cache shareHongzhen Luo
This patch implements the .fadvise interface for page cache share. Similar to overlayfs, it drops those clean, unused pages through vfs_fadvise(). Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: support compressed inodes for page cache shareHongzhen Luo
This patch adds page cache sharing functionality for compressed inodes. Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: support unencoded inodes for page cache shareHongbo Li
This patch adds inode page cache sharing functionality for unencoded files. I conducted experiments in the container environment. Below is the memory usage for reading all files in two different minor versions of container images: +-------------------+------------------+-------------+---------------+ | Image | Page Cache Share | Memory (MB) | Memory | | | | | Reduction (%) | +-------------------+------------------+-------------+---------------+ | | No | 241 | - | | redis +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 163 | 33% | +-------------------+------------------+-------------+---------------+ | | No | 872 | - | | postgres +------------------+-------------+---------------+ | 16.1 & 16.2 | Yes | 630 | 28% | +-------------------+------------------+-------------+---------------+ | | No | 2771 | - | | tensorflow +------------------+-------------+---------------+ | 2.11.0 & 2.11.1 | Yes | 2340 | 16% | +-------------------+------------------+-------------+---------------+ | | No | 926 | - | | mysql +------------------+-------------+---------------+ | 8.0.11 & 8.0.12 | Yes | 735 | 21% | +-------------------+------------------+-------------+---------------+ | | No | 390 | - | | nginx +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 219 | 44% | +-------------------+------------------+-------------+---------------+ | tomcat | No | 924 | - | | 10.1.25 & 10.1.26 +------------------+-------------+---------------+ | | Yes | 474 | 49% | +-------------------+------------------+-------------+---------------+ Additionally, the table below shows the runtime memory usage of the container: +-------------------+------------------+-------------+---------------+ | Image | Page Cache Share | Memory (MB) | Memory | | | | | Reduction (%) | +-------------------+------------------+-------------+---------------+ | | No | 35 | - | | redis +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 28 | 20% | +-------------------+------------------+-------------+---------------+ | | No | 149 | - | | postgres +------------------+-------------+---------------+ | 16.1 & 16.2 | Yes | 95 | 37% | +-------------------+------------------+-------------+---------------+ | | No | 1028 | - | | tensorflow +------------------+-------------+---------------+ | 2.11.0 & 2.11.1 | Yes | 930 | 10% | +-------------------+------------------+-------------+---------------+ | | No | 155 | - | | mysql +------------------+-------------+---------------+ | 8.0.11 & 8.0.12 | Yes | 132 | 15% | +-------------------+------------------+-------------+---------------+ | | No | 25 | - | | nginx +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 20 | 20% | +-------------------+------------------+-------------+---------------+ | tomcat | No | 186 | - | | 10.1.25 & 10.1.26 +------------------+-------------+---------------+ | | Yes | 98 | 48% | +-------------------+------------------+-------------+---------------+ Co-developed-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: pass inode to trace_erofs_read_folioHongbo Li
The trace_erofs_read_folio accesses inode information through folio, but this method fails if the real inode is not associated with the folio(such as in the upcoming page cache sharing case). Therefore, we pass the real inode to it so that the inode information can be printed out in that case. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: introduce the page cache share featureHongzhen Luo
Currently, reading files with different paths (or names) but the same content will consume multiple copies of the page cache, even if the content of these page caches is the same. For example, reading identical files (e.g., *.so files) from two different minor versions of container images will cost multiple copies of the same page cache, since different containers have different mount points. Therefore, sharing the page cache for files with the same content can save memory. This introduces the page cache share feature in erofs. It allocate a shared inode and use its page cache as shared. Reads for files with identical content will ultimately be routed to the page cache of the shared inode. In this way, a single page cache satisfies multiple read requests for different files with the same contents. We introduce new mount option `inode_share` to enable the page sharing mode during mounting. This option is used in conjunction with `domain_id` to share the page cache within the same trusted domain. Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: using domain_id in the safer wayHongbo Li
Either the existing fscache usecase or the upcoming page cache sharing case, the `domain_id` should be protected as sensitive information, so we use the safer helpers to allocate, free and display domain_id. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: add erofs_inode_set_aops helper to set the aopsHongbo Li
Add erofs_inode_set_aops helper to set the inode->i_mapping->a_ops and use IS_ENABLED to make it cleaner. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: support user-defined fingerprint nameHongzhen Luo
When creating the EROFS image, users can specify the fingerprint name. This is to prepare for the upcoming inode page cache share. Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: decouple `struct erofs_anon_fs_type`Gao Xiang
- Move the `struct erofs_anon_fs_type` to super.c and expose it in preparation for the upcoming page cache share feature; - Remove the `.owner` field, as they are all internal mounts and fully managed by EROFS. Retaining `.owner` would unnecessarily increment module reference counts, preventing the EROFS kernel module from being unloaded. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23Merge branch 'vfs-7.0.iomap' of ↵Gao Xiang
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull 'vfs-7.0.iomap' to allow iomap page cache users to set `iomap_iter::private` for the upcoming page cache sharing support. It also includes a patch to avoid triggering inline data reads for the FIEMAP operation. Signed-off-by: Gao Xiang <xiang@kernel.org>
2026-01-23erofs: tidy up erofs_init_inode_xattrs()Gao Xiang
Mainly get rid of the use of `struct erofs_xattr_iter`, as it is no longer needed now that meta buffers are used. This also simplifies the code and uses an early return when there are no xattrs. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: tidy up synchronous decompressionGao Xiang
- Get rid of `sbi->opt.max_sync_decompress_pages` since it's fixed as 3 all the time; - Add Z_EROFS_MAX_SYNC_DECOMPRESS_BYTES in bytes instead of in pages, since for non-4K pages, 3-page limitation makes no sense; - Move `sync_decompress` to sbi to avoid unexpected remount impact; - Fold z_erofs_is_sync_decompress() into its caller; - Better description of sysfs entry `sync_decompress`. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: remove useless src in erofs_xattr_copy_to_buffer()Ferry Meng
Use it->kaddr directly. Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: unexport erofs_xattr_prefix()Gao Xiang
It can be simply in xattr.c due to no external users. Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: unexport erofs_getxattr()Gao Xiang
No external users other than those in xattr.c. Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: fix incorrect early exits in volume label handlingGao Xiang
Crafted EROFS images containing valid volume labels can trigger incorrect early returns, leading to folio reference leaks. However, this does not cause system crashes or other severe issues. Fixes: 1cf12c717741 ("erofs: Add support for FS_IOC_GETFSLABEL") Cc: stable@kernel.org Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: fix incorrect early exits for invalid metabox-enabled imagesGao Xiang
Crafted EROFS images with metadata compression enabled can trigger incorrect early returns, leading to folio reference leaks. However, this does not cause system crashes or other severe issues. Fixes: 414091322c63 ("erofs: implement metadata compression") Cc: stable@kernel.org Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: avoid noisy messages for transient -ENOMEMGao Xiang
EROFS may allocate temporary pages using GFP_NOWAIT | GFP_NORETRY when pcl->besteffort is off (e.g., for readahead requests). If the allocation fails, the original request will fall back to synchronous read, so the failure is transient. Such fallback can frequently happen in low memory scenarios, but since these failures are expected and temporary, avoid printing error messages like below: [ 7425.184264] erofs (device sr0): failed to decompress (lz4) -ENOMEM @ pa 148447232 size 28672 => 26788 [ 7426.244267] erofs (device sr0): failed to decompress (lz4) -ENOMEM @ pa 149422080 size 28672 => 15903 [ 7426.245508] erofs (device sr0): failed to decompress (lz4) -ENOMEM @ pa 138440704 size 28672 => 39294 ... [ 7504.258373] erofs (device sr0): failed to decompress (lz4) -ENOMEM @ pa 93581312 size 20480 => 47366 Fixes: 831faabed812 ("erofs: improve decompression error reporting") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: improve LZ4 error stringsGao Xiang
Just like what was done for other algorithms, let's propagate detailed error reasons for LZ4 instead of just -EFSCORRUPTED to users: "corrupted compressed data": the compressed data is malformed or destination buffer is not large enough "unexpected end of stream": the compressed stream ends normally, but without producing enough decompressed data. "compressed data start not found": can be returned by z_erofs_fixup_insize(). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: simplify the code using for_each_set_bitYuwen Chen
When mounting the EROFS file system, it is necessary to check the available compression algorithms. At this time, the for_each_set_bit function can be used to simplify the code logic. Signed-off-by: Yuwen Chen <ywen.chen@foxmail.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: make z_erofs_crypto[] staticFerry Meng
Reduce the scope of 'z_erofs_crypto[]' that is not used outside of 'decompressor_crypto.c'. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512102025.4mWeBSsf-lkp@intel.com/ Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: Use %pe format specifier for error pointersFerry Meng
%pe will print a symbolic error name (e.g,. -ENOMEM), opposed to the raw errno (e.g,. -12) produced by PTR_ERR(). Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-14erofs: hold read context in iomap_iter if neededHongbo Li
Introduce `struct erofs_iomap_iter_ctx` to hold both `struct page *` and `void *base`, avoiding bogus use of `kmap_to_page()` in `erofs_iomap_end()`. With this change, fiemap and bmap no longer need to read inline data. Additionally, the upcoming page cache sharing mechanism requires passing the backing inode pointer to `erofs_iomap_{begin,end}()`, as I/O accesses must apply to backing inodes rather than anon inodes. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://patch.msgid.link/20260109102856.598531-3-lihongbo22@huawei.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-13uapi: promote EFSCORRUPTED and EUCLEAN to errno.hDarrick J. Wong
Stop definining these privately and instead move them to the uapi errno.h so that they become canonical instead of copy pasta. Cc: linux-api@vger.kernel.org Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12erofs: add setlease file operationJeff Layton
Add the setlease file_operation to erofs_file_fops and erofs_dir_fops, pointing to generic_setlease. A future patch will change the default behavior to reject lease attempts with -EINVAL when there is no setlease file operation defined. Add generic_setlease to retain the ability to set leases on this filesystem. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260108-setlease-6-20-v1-4-ea4dec9b67fa@kernel.org Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-10erofs: fix file-backed mounts no longer working on EROFS partitionsGao Xiang
Sheng Yong reported [1] that Android APEX images didn't work with commit 072a7c7cdbea ("erofs: don't bother with s_stack_depth increasing for now") because "EROFS-formatted APEX file images can be stored within an EROFS-formatted Android system partition." In response, I sent a quick fat-fingered [PATCH v3] to address the report. Unfortunately, the updated condition was incorrect: if (erofs_is_fileio_mode(sbi)) { - sb->s_stack_depth = - file_inode(sbi->dif0.file)->i_sb->s_stack_depth + 1; - if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) { - erofs_err(sb, "maximum fs stacking depth exceeded"); + inode = file_inode(sbi->dif0.file); + if ((inode->i_sb->s_op == &erofs_sops && !sb->s_bdev) || + inode->i_sb->s_stack_depth) { The condition `!sb->s_bdev` is always true for all file-backed EROFS mounts, making the check effectively a no-op. The real fix tested and confirmed by Sheng Yong [2] at that time was [PATCH v3 RESEND], which correctly ensures the following EROFS^2 setup works: EROFS (on a block device) + EROFS (file-backed mount) But sadly I screwed it up again by upstreaming the outdated [PATCH v3]. This patch applies the same logic as the delta between the upstream [PATCH v3] and the real fix [PATCH v3 RESEND]. Reported-by: Sheng Yong <shengyong1@xiaomi.com> Closes: https://lore.kernel.org/r/3acec686-4020-4609-aee4-5dae7b9b0093@gmail.com [1] Fixes: 072a7c7cdbea ("erofs: don't bother with s_stack_depth increasing for now") Link: https://lore.kernel.org/r/243f57b8-246f-47e7-9fb1-27a771e8e9e8@gmail.com [2] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>