summaryrefslogtreecommitdiff
path: root/include/linux/fs.h
AgeCommit message (Collapse)Author
2025-07-31Merge tag 'fsnotify_for_v6.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "A couple of small improvements for fsnotify subsystem. The most interesting is probably Amir's change modifying the meaning of fsnotify fmode bits (and I spell it out specifically because I know you care about those). There's no change for the common cases of no fsnotify watches or no permission event watches. But when there are permission watches (either for open or for pre-content events) but no FAN_ACCESS_PERM watch (which nobody uses in practice) we are now able optimize away unnecessary cache loads from the read path" * tag 'fsnotify_for_v6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: optimize FMODE_NONOTIFY_PERM for the common cases fsnotify: merge file_set_fsnotify_mode_from_watchers() with open perm hook samples: fix building fs-monitor on musl systems fanotify: sanitize handle_type values when reporting fid
2025-07-28Merge tag 'vfs-6.17-rc1.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock callback update from Christian Brauner: "Currently all filesystems which implement super_operations::shutdown() can not afford losing a device. Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the involved filesystem. But it will no longer be the case, as multi-device filesystems like btrfs can handle certain device loss without the need to shutdown the whole filesystem. To allow those multi-device filesystems to be integrated to use fs_holder_ops: - Add a new super_operations::remove_bdev() callback - Try ->remove_bdev() callback first inside fs_bdev_mark_dead(). If the callback returned 0, meaning the fs can handling the device loss, then exit without doing anything else. If there is no such callback or the callback returned non-zero value, continue to shutdown the filesystem as usual. This means the new remove_bdev() should only do the check on whether the operation can continue, and if so do the fs specific handlings. The shutdown handling should still be handled by the existing ->shutdown() callback. For all existing filesystems with shutdown callback, there is no change to the code nor behavior. Btrfs is going to implement both the ->remove_bdev() and ->shutdown() callbacks soon" * tag 'vfs-6.17-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add a new remove_bdev() callback
2025-07-28Merge tag 'vfs-6.17-rc1.fileattr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fileattr updates from Christian Brauner: "This introduces the new file_getattr() and file_setattr() system calls after lengthy discussions. Both system calls serve as successors and extensible companions to the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have started to show their age in addition to being named in a way that makes it easy to conflate them with extended attribute related operations. These syscalls allow userspace to set filesystem inode attributes on special files. One of the usage examples is the XFS quota projects. XFS has project quotas which could be attached to a directory. All new inodes in these directories inherit project ID set on parent directory. The project is created from userspace by opening and calling FS_IOC_FSSETXATTR on each inode. This is not possible for special files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left with empty project ID. Those inodes then are not shown in the quota accounting but still exist in the directory. This is not critical but in the case when special files are created in the directory with already existing project quota, these new inodes inherit extended attributes. This creates a mix of special files with and without attributes. Moreover, special files with attributes don't have a possibility to become clear or change the attributes. This, in turn, prevents userspace from re-creating quota project on these existing files. In addition, these new system calls allow the implementation of additional attributes that we couldn't or didn't want to fit into the legacy ioctls anymore" * tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: tighten a sanity check in file_attr_to_fileattr() tree-wide: s/struct fileattr/struct file_kattr/g fs: introduce file_getattr and file_setattr syscalls fs: prepare for extending file_get/setattr() fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP selinux: implement inode_file_[g|s]etattr hooks lsm: introduce new hooks for setting/getting inode fsxattr fs: split fileattr related helpers into separate file
2025-07-28Merge tag 'vfs-6.17-rc1.mmap_prepare' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mmap_prepare updates from Christian Brauner: "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). This is preferred to the existing f_op->mmap() hook as it does require a VMA to be established yet, thus allowing the mmap logic to invoke this hook far, far earlier, prior to inserting a VMA into the virtual address space, or performing any other heavy handed operations. This allows for much simpler unwinding on error, and for there to be a single attempt at merging a VMA rather than having to possibly reattempt a merge based on potentially altered VMA state. Far more importantly, it prevents inappropriate manipulation of incompletely initialised VMA state, which is something that has been the cause of bugs and complexity in the past. The intent is to gradually deprecate f_op->mmap, and in that vein this series coverts the majority of file systems to using f_op->mmap_prepare. Prerequisite steps are taken - firstly ensuring all checks for mmap capabilities use the file_has_valid_mmap_hooks() helper rather than directly checking for f_op->mmap (which is now not a valid check) and secondly updating daxdev_mapping_supported() to not require a VMA parameter to allow ext4 and xfs to be converted. Commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") handles the nasty edge-case of nested file systems like overlayfs, which introduces a compatibility shim to allow f_op->mmap_prepare() to be invoked from an f_op->mmap() callback. This allows for nested filesystems to continue to function correctly with all file systems regardless of which callback is used. Once we finally convert all file systems, this shim can be removed. As a result, ecryptfs, fuse, and overlayfs remain unaltered so they can nest all other file systems. We additionally do not update resctl - as this requires an update to remap_pfn_range() (or an alternative to it) which we defer to a later series, equally we do not update cramfs which needs a mixed mapping insertion with the same issue, nor do we update procfs, hugetlbfs, syfs or kernfs all of which require VMAs for internal state and hooks. We shall return to all of these later" * tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: update porting, vfs documentation to describe mmap_prepare() fs: replace mmap hook with .mmap_prepare for simple mappings fs: convert most other generic_file_*mmap() users to .mmap_prepare() fs: convert simple use of generic_file_*_mmap() to .mmap_prepare() mm/filemap: introduce generic_file_*_mmap_prepare() helpers fs/xfs: transition from deprecated .mmap hook to .mmap_prepare fs/ext4: transition from deprecated .mmap hook to .mmap_prepare fs/dax: make it possible to check dev dax support without a VMA fs: consistently use can_mmap_file() helper mm/nommu: use file_has_valid_mmap_hooks() helper mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
2025-07-28Merge tag 'vfs-6.17-rc1.ovl' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull overlayfs updates from Christian Brauner: "This contains overlayfs updates for this cycle. The changes for overlayfs in here are primarily focussed on preparing for some proposed changes to directory locking. Overlayfs currently will sometimes lock a directory on the upper filesystem and do a few different things while holding the lock. This is incompatible with the new potential scheme. This series narrows the region of code protected by the directory lock, taking it multiple times when necessary. This theoretically opens up the possibilty of other changes happening on the upper filesytem between the unlock and the lock. To some extent the patches guard against that by checking the dentries still have the expect parent after retaking the lock. In general, concurrent changes to the upper and lower filesystems aren't supported properly anyway" * tag 'vfs-6.17-rc1.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) ovl: properly print correct variable ovl: rename ovl_cleanup_unlocked() to ovl_cleanup() ovl: change ovl_create_real() to receive dentry parent ovl: narrow locking in ovl_check_rename_whiteout() ovl: narrow locking in ovl_whiteout() ovl: change ovl_cleanup_and_whiteout() to take rename lock as needed ovl: narrow locking on ovl_remove_and_whiteout() ovl: change ovl_workdir_cleanup() to take dir lock as needed. ovl: narrow locking in ovl_workdir_cleanup_recurse() ovl: narrow locking in ovl_indexdir_cleanup() ovl: narrow locking in ovl_workdir_create() ovl: narrow locking in ovl_cleanup_index() ovl: narrow locking in ovl_cleanup_whiteouts() ovl: narrow locking in ovl_rename() ovl: simplify gotos in ovl_rename() ovl: narrow locking in ovl_create_over_whiteout() ovl: narrow locking in ovl_clear_empty() ovl: narrow locking in ovl_create_upper() ovl: narrow the locked region in ovl_copy_up_workdir() ovl: Call ovl_create_temp() without lock held. ...
2025-07-28Merge tag 'vfs-6.17-rc1.coredump' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull coredump updates from Christian Brauner: "This contains an extension to the coredump socket and a proper rework of the coredump code. - This extends the coredump socket to allow the coredump server to tell the kernel how to process individual coredumps. This allows for fine-grained coredump management. Userspace can decide to just let the kernel write out the coredump, or generate the coredump itself, or just reject it. * COREDUMP_KERNEL The kernel will write the coredump data to the socket. * COREDUMP_USERSPACE The kernel will not write coredump data but will indicate to the parent that a coredump has been generated. This is used when userspace generates its own coredumps. * COREDUMP_REJECT The kernel will skip generating a coredump for this task. * COREDUMP_WAIT The kernel will prevent the task from exiting until the coredump server has shutdown the socket connection. The flexible coredump socket can be enabled by using the "@@" prefix instead of the single "@" prefix for the regular coredump socket: @@/run/systemd/coredump.socket - Cleanup the coredump code properly while we have to touch it anyway. Split out each coredump mode in a separate helper so it's easy to grasp what is going on and make the code easier to follow. The core coredump function should now be very trivial to follow" * tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits) cleanup: add a scoped version of CLASS() coredump: add coredump_skip() helper coredump: avoid pointless variable coredump: order auto cleanup variables at the top coredump: add coredump_cleanup() coredump: auto cleanup prepare_creds() cred: add auto cleanup method coredump: directly return coredump: auto cleanup argv coredump: add coredump_write() coredump: use a single helper for the socket coredump: move pipe specific file check into coredump_pipe() coredump: split pipe coredumping into coredump_pipe() coredump: move core_pipe_count to global variable coredump: prepare to simplify exit paths coredump: split file coredumping into coredump_file() coredump: rename do_coredump() to vfs_coredump() selftests/coredump: make sure invalid paths are rejected coredump: validate socket path in coredump_parse() coredump: don't allow ".." in coredump socket path ...
2025-07-28Merge tag 'vfs-6.17-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...
2025-07-28Merge tag 'pull-rpc_pipefs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull rpc_pipefs updates from Al Viro: "Massage rpc_pipefs to use saner primitives and clean up the APIs provided to the rest of the kernel" * tag 'pull-rpc_pipefs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: rpc_create_client_dir(): return 0 or -E... rpc_create_client_dir(): don't bother with rpc_populate() rpc_new_dir(): the last argument is always NULL rpc_pipe: expand the calls of rpc_mkdir_populate() rpc_gssd_dummy_populate(): don't bother with rpc_populate() rpc_mkpipe_dentry(): switch to simple_start_creating() rpc_pipe: saner primitive for creating regular files rpc_pipe: saner primitive for creating subdirectories rpc_pipe: don't overdo directory locking rpc_mkpipe_dentry(): saner calling conventions rpc_unlink(): saner calling conventions rpc_populate(): lift cleanup into callers rpc_unlink(): use simple_recursive_removal() rpc_{rmdir_,}depopulate(): use simple_recursive_removal() instead rpc_pipe: clean failure exits in fill_super new helper: simple_start_creating()
2025-07-28Merge tag 'pull-simple_recursive_removal' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull simple_recursive_removal() update from Al Viro: "Removing subtrees of kernel filesystems is done in quite a few places; unfortunately, it's easy to get wrong. A number of open-coded attempts are out there, with varying amount of bogosities. simple_recursive_removal() had been introduced for doing that with all precautions needed; it does an equivalent of rm -rf, with sufficient locking, eviction of anything mounted on top of the subtree, etc. This series converts a bunch of open-coded instances to using that" * tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: functionfs, gadgetfs: use simple_recursive_removal() kill binderfs_remove_file() fuse_ctl: use simple_recursive_removal() pstore: switch to locked_recursive_removal() binfmt_misc: switch to locked_recursive_removal() spufs: switch to locked_recursive_removal() add locked_recursive_removal() better lockdep annotations for simple_recursive_removal() simple_recursive_removal(): saner interaction with fsnotify
2025-07-28Merge tag 'pull-dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dentry d_flags updates from Al Viro: "The current exclusion rules for dentry->d_flags stores are rather unpleasant. The basic rules are simple: - stores to dentry->d_flags are OK under dentry->d_lock - stores to dentry->d_flags are OK in the dentry constructor, before becomes potentially visible to other threads Unfortunately, there's a couple of exceptions to that, and that's where the headache comes from. The main PITA comes from d_set_d_op(); that primitive sets ->d_op of dentry and adjusts the flags that correspond to presence of individual methods. It's very easy to misuse; existing uses _are_ safe, but proof of correctness is brittle. Use in __d_alloc() is safe (we are within a constructor), but we might as well precalculate the initial value of 'd_flags' when we set the default ->d_op for given superblock and set 'd_flags' directly instead of messing with that helper. The reasons why other uses are safe are bloody convoluted; I'm not going to reproduce it here. See [1] for gory details, if you care. The critical part is using d_set_d_op() only just prior to d_splice_alias(), which makes a combination of d_splice_alias() with setting ->d_op, etc a natural replacement primitive. Better yet, if we go that way, it's easy to take setting ->d_op and modifying 'd_flags' under ->d_lock, which eliminates the headache as far as 'd_flags' exclusion rules are concerned. Other exceptions are minor and easy to deal with. What this series does: - d_set_d_op() is no longer available; instead a new primitive (d_splice_alias_ops()) is provided, equivalent to combination of d_set_d_op() and d_splice_alias(). - new field of struct super_block - 's_d_flags'. This sets the default value of 'd_flags' to be used when allocating dentries on this filesystem. - new primitive for setting 's_d_op': set_default_d_op(). This replaces stores to 's_d_op' at mount time. All in-tree filesystems converted; out-of-tree ones will get caught by the compiler ('s_d_op' is renamed, so stores to it will be caught). 's_d_flags' is set by the same primitive to match the 's_d_op'. - a lot of filesystems had sb->s_d_op->d_delete equal to always_delete_dentry; that is equivalent to setting DCACHE_DONTCACHE in 'd_flags', so such filesystems can bloody well set that bit in 's_d_flags' and drop 'd_delete()' from dentry_operations. In quite a few cases that results in empty dentry_operations, which means that we can get rid of those. - kill simple_dentry_operations - not needed anymore - massage d_alloc_parallel() to get rid of the other exception wrt 'd_flags' stores - we can set DCACHE_PAR_LOOKUP as soon as we allocate the new dentry; no need to delay that until we commit to using the sucker. As the result, 'd_flags' stores are all either under ->d_lock or done before the dentry becomes visible in any shared data structures" Link: https://lore.kernel.org/all/20250224010624.GT1977892@ZenIV/ [1] * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits) configfs: use DCACHE_DONTCACHE debugfs: use DCACHE_DONTCACHE efivarfs: use DCACHE_DONTCACHE instead of always_delete_dentry() 9p: don't bother with always_delete_dentry ramfs, hugetlbfs, mqueue: set DCACHE_DONTCACHE kill simple_dentry_operations devpts, sunrpc, hostfs: don't bother with ->d_op shmem: no dentry retention past the refcount reaching zero d_alloc_parallel(): set DCACHE_PAR_LOOKUP earlier make d_set_d_op() static simple_lookup(): just set DCACHE_DONTCACHE tracefs: Add d_delete to remove negative dentries set_default_d_op(): calculate the matching value for ->d_flags correct the set of flags forbidden at d_set_d_op() time split d_flags calculation out of d_set_d_op() new helper: set_default_d_op() fuse: no need for special dentry_operations for root dentry switch procfs from d_set_d_op() to d_splice_alias_ops() new helper: d_splice_alias_ops() procfs: kill ->proc_dops ...
2025-07-28fsnotify: optimize FMODE_NONOTIFY_PERM for the common casesAmir Goldstein
The most unlikely watched permission event is FAN_ACCESS_PERM, because at the time that it was introduced there were no evictable ignore mark, so subscribing to FAN_ACCESS_PERM would have incured a very high overhead. Yet, when we set the fmode to FMODE_NOTIFY_HSM(), we never skip trying to send FAN_ACCESS_PERM, which is almost always a waste of cycles. We got to this logic because of bundling FAN_OPEN*_PERM and FAN_ACCESS_PERM in the same category and because FAN_OPEN_PERM is a commonly used event. By open coding fsnotify_open_perm() in fsnotify_open_perm_and_set_mode(), we no longer need to regard FAN_OPEN*_PERM when calculating fmode. This leaves the case of having pre-content events and not having any other permission event in the object masks a more likely case than the other way around. Rework the fmode macros and code so that their meaning now refers only to hooks on an already open file: - FMODE_NOTIFY_NONE() skip all events - FMODE_NOTIFY_ACCESS_PERM() send all permission events including FAN_ACCESS_PERM - FMODE_NOTIFY_HSM() send pre-content permission events Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250708143641.418603-3-amir73il@gmail.com
2025-07-18fs: constify file ptr in backing_file accessor helpersAmir Goldstein
Add internal helper backing_file_set_user_path() for the only two cases that need to modify backing_file fields. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250607115304.2521155-2-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16fs: change write_begin/write_end interface to take struct kiocb *Taotao Chen
Change the address_space_operations callbacks write_begin() and write_end() to take struct kiocb * as the first argument instead of struct file *. Update all affected function prototypes, implementations, call sites, and related documentation across VFS, filesystems, and block layer. Part of a series refactoring address_space_operations write_begin and write_end callbacks to use struct kiocb for passing write context and flags. Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-15fs: add a new remove_bdev() callbackQu Wenruo
Currently all filesystems which implement super_operations::shutdown() can not afford losing a device. Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the involved filesystem. But it will no longer be the case, as multi-device filesystems like btrfs and bcachefs can handle certain device loss without the need to shutdown the whole filesystem. To allow those multi-device filesystems to be integrated to use fs_holder_ops: - Add a new super_operations::remove_bdev() callback - Try ->remove_bdev() callback first inside fs_bdev_mark_dead() If the callback returned 0, meaning the fs can handling the device loss, then exit without doing anything else. If there is no such callback or the callback returned non-zero value, continue to shutdown the filesystem as usual. This means the new remove_bdev() should only do the check on whether the operation can continue, and if so do the fs specific handlings. The shutdown handling should still be handled by the existing ->shutdown() callback. For all existing filesystems with shutdown callback, there is no change to the code nor behavior. Btrfs is going to implement both the ->remove_bdev() and ->shutdown() callbacks soon. Signed-off-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/09909fcff7f2763cc037fec97ac2482bdc0a12cb.1752470276.git.wqu@suse.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04Merge tag 'vfs-6.16-rc5.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a regression caused by the anonymous inode rework. Making them regular files causes various places in the kernel to tip over starting with io_uring. Revert to the former status quo and port our assertion to be based on checking the inode so we don't lose the valuable VFS_*_ON_*() assertions that have already helped discover weird behavior our outright bugs. - Fix the the upper bound calculation in fuse_fill_write_pages() - Fix priority inversion issues in the eventpoll code - Make secretmen use anon_inode_make_secure_inode() to avoid bypassing the LSM layer - Fix a netfs hang due to missing case in final DIO read result collection - Fix a double put of the netfs_io_request struct - Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag wrangling - Fix infinite looping in netfs_wait_for_pause/request() - Fix a netfs ref leak on an extra subrequest inserted into a request's list of subreqs - Fix various cifs RPC callbacks to set NETFS_SREQ_NEED_RETRY if a subrequest fails retriably - Fix a cifs warning in the workqueue code when reconnecting a channel - Fix the updating of i_size in netfs to avoid a race between testing if we should have extended the file with a DIO write and changing i_size - Merge the places in netfs that update i_size on write - Fix coredump socket selftests * tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: anon_inode: rework assertions netfs: Update tracepoints in a number of ways netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read netfs: Merge i_size update functions netfs: Fix i_size updating smb: client: set missing retry flag in cifs_writev_callback() smb: client: set missing retry flag in cifs_readv_callback() smb: client: set missing retry flag in smb2_writev_callback() netfs: Fix ref leak on inserted extra subreq in write retry netfs: Fix looping in wait functions netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling netfs: Fix double put of request netfs: Fix hang due to missing case in final DIO read result collection eventpoll: Fix priority inversion problem fuse: fix fuse_fill_write_pages() upper bound calculation fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass selftests/coredump: Fix "socket_detect_userspace_client" test failure
2025-07-04tree-wide: s/struct fileattr/struct file_kattr/gChristian Brauner
Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02new helper: simple_start_creating()Al Viro
Set the things up for kernel-initiated creation of object in a tree-in-dcache filesystem. With respect to locking it's an equivalent of filename_create() - we either get a negative dentry with locked parent, or ERR_PTR() and no locks taken. tracefs and debugfs had that open-coded as part of their object creation machinery; switched to calling new helper. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-02add locked_recursive_removal()Al Viro
simple_recursive_removal() assumes that parent is not locked and locks it when it finally gets to removing the victim itself. Usually that's what we want, but there are places where the parent is *already* locked and we need it to stay that way. In those cases simple_recursive_removal() would, of course, deadlock, so we have to play racy games with unlocking/relocking the parent around the call or open-code the entire thing. A better solution is to provide a variant that expects to be called with the parent already locked by the caller. Parent should be locked with I_MUTEX_PARENT, to avoid false positives from lockdep. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypassShivank Garg
Export anon_inode_make_secure_inode() to allow KVM guest_memfd to create anonymous inodes with proper security context. This replaces the current pattern of calling alloc_anon_inode() followed by inode_init_security_anon() for creating security context manually. This change also fixes a security regression in secretmem where the S_PRIVATE flag was not cleared after alloc_anon_inode(), causing LSM/SELinux checks to be bypassed for secretmem file descriptors. As guest_memfd currently resides in the KVM module, we need to export this symbol for use outside the core kernel. In the future, guest_memfd might be moved to core-mm, at which point the symbols no longer would have to be exported. When/if that happens is still unclear. Fixes: 2bfe15c52612 ("mm: create security context for memfd_secret inodes") Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/20250620070328.803704-3-shivankg@amd.com Acked-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23docs/vfs: update references to i_mutex to i_rwsemJunxuan Liao
VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel lookups actual switch to rwsem), but the VFS documentation and comments still has references to i_mutex. Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu> Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17mm/filemap: introduce generic_file_*_mmap_prepare() helpersLorenzo Stoakes
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). The generic mmap handlers are very simple, so we can very easily convert these in advance of converting file systems which use them. This patch does so. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/30622c1f0b98c66840bc8c02668bda276a810b70.1750099179.git.lorenzo.stoakes@oracle.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17fs: consistently use can_mmap_file() helperLorenzo Stoakes
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). Additionally, commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") permits the use of the .mmap_prepare() hook even in nested filesystems like overlayfs. There are a number of places where we check only for f_op->mmap - this is incorrect now mmap_prepare exists, so update all of these to use the general helper can_mmap_file(). Most notably, this updates the elf logic to allow for the ability to execute binaries on filesystems which have the .mmap_prepare hook, but additionally we update nested filesystems. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/b68145b609532e62bab603dd9686faa6562046ec.1750099179.git.lorenzo.stoakes@oracle.com Acked-by: Kees Cook <kees@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepareLorenzo Stoakes
The call_mmap() function violates the existing convention in include/linux/fs.h whereby invocations of virtual file system hooks is performed by functions prefixed with vfs_xxx(). Correct this by renaming call_mmap() to vfs_mmap(). This also avoids confusion as to the fact that f_op->mmap_prepare may be invoked here. Also rename __call_mmap_prepare() function to vfs_mmap_prepare() and adjust to accept a file parameter, this is useful later for nested file systems. Finally, fix up the VMA userland tests and ensure the mmap_prepare -> mmap shim is implemented there. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/8d389f4994fa736aa8f9172bef8533c10a9e9011.1750099179.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16Merge tag 'vfs-6.16-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a regression in overlayfs caused by reworking the lookup_one*() set of helpers - Make sure that the name of the dentry is printed in overlayfs' mkdir() helper - Add missing iocb values to TRACE_IOCB_STRINGS define - Unlock the superblock during iterate_supers_type(). This was an accidental internal api change - Drop a misleading assert in file_seek_cur_needs_f_lock() helper - Never refuse to return PIDFD_GET_INGO when parent pid is zero That can trivially happen in container scenarios where the parent process might be located in an ancestor pid namespace - Don't revalidate in try_lookup_noperm() as that causes regression for filesystems such as cifs - Fix simple_xattr_list() and reset the err variable after security_inode_listsecurity() got called so as not to confuse userspace about the length of the xattr * tag 'vfs-6.16-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: drop assert in file_seek_cur_needs_f_lock fs: unlock the superblock during iterate_supers_type ovl: fix debug print in case of mkdir error VFS: change try_lookup_noperm() to skip revalidation fs: add missing values to TRACE_IOCB_STRINGS fs/xattr.c: fix simple_xattr_list() ovl: fix regression caused by lookup helpers API changes pidfs: never refuse ppid == 0 in PIDFD_GET_INFO
2025-06-16fs: move name_contains_dotdot() to headerChristian Brauner
Move the helper from the firmware specific code to a header so we can reuse it for coredump sockets. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-5-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-16VFS: change old_dir and new_dir in struct renamedata to dentrysNeilBrown
all users of 'struct renamedata' have the dentry for the old and new directories, and often have no use for the inode except to store it in the renamedata. This patch changes struct renamedata to hold the dentry, rather than the inode, for the old and new directories, and changes callers to match. The names are also changed from a _dir suffix to _parent. This is consistent with other usage in namei.c and elsewhere. This results in the removal of several local variables and several dereferences of ->d_inode at the cost of adding ->d_inode dereferences to vfs_rename(). Acked-by: Miklos Szeredi <miklos@szeredi.hu> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/174977089072.608730.4244531834577097454@noble.neil.brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-12mm: add mmap_prepare() compatibility layer for nested file systemsLorenzo Stoakes
Nested file systems, that is those which invoke call_mmap() within their own f_op->mmap() handlers, may encounter underlying file systems which provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). We have a chicken-and-egg scenario here - until all file systems are converted to using .mmap_prepare(), we cannot convert these nested handlers, as we can't call f_op->mmap from an .mmap_prepare() hook. So we have to do it the other way round - invoke the .mmap_prepare() hook from an .mmap() one. in order to do so, we need to convert VMA state into a struct vm_area_desc descriptor, invoking the underlying file system's f_op->mmap_prepare() callback passing a pointer to this, and then setting VMA state accordingly and safely. This patch achieves this via the compat_vma_mmap_prepare() function, which we invoke from call_mmap() if f_op->mmap_prepare() is specified in the passed in file pointer. We place the fundamental logic into mm/vma.h where VMA manipulation belongs. We also update the VMA userland tests to accommodate the changes. The compat_vma_mmap_prepare() function and its associated machinery is temporary, and will be removed once the conversion of file systems is complete. We carefully place this code so it can be used with CONFIG_MMU and also with cutting edge nommu silicon. [akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build] [lorenzo.stoakes@oracle.com: remove unused declarations] Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-11kill simple_dentry_operationsAl Viro
No users left and anything that wants it would be better off just setting DCACHE_DONTCACHE in their ->s_d_flags. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11set_default_d_op(): calculate the matching value for ->d_flagsAl Viro
... and store it in ->s_d_flags, to be used by __d_alloc() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11fs: add missing values to TRACE_IOCB_STRINGSChristoph Hellwig
Make sure all values are covered. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250610140020.2227932-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-10new helper: set_default_d_op()Al Viro
... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-02Merge tag 'vfs-6.16-rc2.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know how to decode a connected non-dir dentry fail the request - Use repr(transparent) to ensure identical layout between the C and Rust implementation of struct file - Add a missing xas_pause() into the dax code employing wait_entry_unlocked_exclusive() - Fix FOP_DONTCACHE which we disabled for v6.15. A folio could get redirtied and/or scheduled for writeback after the initial dropbehind test. Change the test accordingly to handle these cases so we can re-enable FOP_DONTCACHE again * tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exportfs: require ->fh_to_parent() to encode connectable file handles rust: file: improve safety comments rust: file: mark `LocalFile` as `repr(transparent)` fs/dax: Fix "don't skip locked entries when scanning entries" iomap: don't lose folio dropbehind state for overwrites mm/filemap: unify dropbehind flag testing and clearing mm/filemap: unify read/write dropbehind naming Revert "Disable FOP_DONTCACHE for now due to bugs" mm/filemap: use filemap_end_dropbehind() for read invalidation mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
2025-05-31Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-30Merge tag 'pull-automount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()
2025-05-27Revert "Disable FOP_DONTCACHE for now due to bugs"Jens Axboe
This reverts commit 478ad02d6844217cc7568619aeb0809d93ade43d. Both the read and write side dirty && writeback races should be resolved now, revert the commit that disabled FOP_DONTCACHE for filesystems. Link: https://lore.kernel.org/linux-fsdevel/20250525083209.GS2023217@ZenIV/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/20250527133255.452431-4-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-26Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: - Atomic writes for XFS - Remove experimental warnings for pNFS, scrub and parent pointers * tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) xfs: add inode to zone caching for data placement xfs: free the item in xfs_mru_cache_insert on failure xfs: remove the EXPERIMENTAL warning for pNFS xfs: remove some EXPERIMENTAL warnings xfs: Remove deprecated xfs_bufd sysctl parameters xfs: stop using set_blocksize xfs: allow sysadmins to specify a maximum atomic write limit at mount time xfs: update atomic write limits xfs: add xfs_calc_atomic_write_unit_max() xfs: add xfs_file_dio_write_atomic() xfs: commit CoW-based atomic writes atomically xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() xfs: add xfs_atomic_write_cow_iomap_begin() xfs: refine atomic write size check in xfs_file_write_iter() xfs: refactor xfs_reflink_end_cow_extent() xfs: allow block allocator to take an alignment hint xfs: ignore HW which cannot atomic write a single block xfs: add helpers to compute transaction reservation for finishing intent items xfs: add helpers to compute log item overhead xfs: separate out setting buftarg atomic writes limits ...
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-26Merge tag 'vfs-6.16-rc1.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs freezing updates from Christian Brauner: "This contains various filesystem freezing related work for this cycle: - Allow the power subsystem to support filesystem freeze for suspend and hibernate. Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. - Allow efivars to support freeze and thaw Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care" * tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: fix freezing filesystem during resize kernfs: add warning about implementing freeze/thaw efivarfs: support freeze/thaw power: freeze filesystems during suspend/resume libfs: export find_next_child() super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks fs: allow all writers to be frozen locking/percpu-rwsem: add freezable alternative to down_read
2025-05-26Merge tag 'vfs-6.16-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Use folios for symlinks in the page cache FUSE already uses folios for its symlinks. Mirror that conversion in the generic code and the NFS code. That lets us get rid of a few folio->page->folio conversions in this path, and some of the few remaining users of read_cache_page() / read_mapping_page() - Try and make a few filesystem operations killable on the VFS inode->i_mutex level - Add sysctl vfs_cache_pressure_denom for bulk file operations Some workloads need to preserve more dentries than we currently allow through out sysctl interface A HDFS servers with 12 HDDs per server, on a HDFS datanode startup involves scanning all files and caching their metadata (including dentries and inodes) in memory. Each HDD contains approximately 2 million files, resulting in a total of ~20 million cached dentries after initialization To minimize dentry reclamation, they set vfs_cache_pressure to 1. Despite this configuration, memory pressure conditions can still trigger reclamation of up to 50% of cached dentries, reducing the cache from 20 million to approximately 10 million entries. During the subsequent cache rebuild period, any HDFS datanode restart operation incurs substantial latency penalties until full cache recovery completes To maintain service stability, more dentries need to be preserved during memory reclamation. The current minimum reclaim ratio (1/100 of total dentries) remains too aggressive for such workload. This patch introduces vfs_cache_pressure_denom for more granular cache pressure control The configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] effectively maintains the full 20 million dentry cache under memory pressure, preventing datanode restart performance degradation - Avoid some jumps in inode_permission() using likely()/unlikely() - Avid a memory access which is most likely a cache miss when descending into devcgroup_inode_permission() - Add fastpath predicts for stat() and fdput() - Anonymous inodes currently don't come with a proper mode causing issues in the kernel when we want to add useful VFS debug assert. Fix that by giving them a proper mode and masking it off when we report it to userspace which relies on them not having any mode - Anonymous inodes currently allow to change inode attributes because the VFS falls back to simple_setattr() if i_op->setattr isn't implemented. This means the ownership and mode for every single user of anon_inode_inode can be changed. Block that as it's either useless or actively harmful. If specific ownership is needed the respective subsystem should allocate anonymous inodes from their own private superblock - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock - Add proper tests for anonymous inode behavior - Make it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead() Cleanups: - Port pidfs to the new anon_inode_{g,s}etattr() helpers - Try to remove the uselib() system call - Add unlikely branch hint return path for poll - Add unlikely branch hint on return path for core_sys_select - Don't allow signals to interrupt getdents copying for fuse - Provide a size hint to dir_context for during readdir() - Use writeback_iter directly in mpage_writepages - Update compression and mtime descriptions in initramfs documentation - Update main netfs API document - Remove useless plus one in super_cache_scan() - Remove unnecessary NULL-check guards during setns() - Add separate separate {get,put}_cgroup_ns no-op cases Fixes: - Fix typo in root= kernel parameter description - Use KERN_INFO for infof()|info_plog()|infofc() - Correct comments of fs_validate_description() - Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep() - Delete macro fsparam_u32hex() - Remove unused and problematic validate_constant_table() - Fix potential unsigned integer underflow in fs_name() - Make file-nr output the total allocated file handles" * tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits) fs: Pass a folio to page_put_link() nfs: Use a folio in nfs_get_link() fs: Convert __page_get_link() to use a folio fs/read_write: make default_llseek() killable fs/open: make do_truncate() killable fs/open: make chmod_common() and chown_common() killable include/linux/fs.h: add inode_lock_killable() readdir: supply dir_context.count as readdir buffer size hint vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations fuse: don't allow signals to interrupt getdents copying Documentation: fix typo in root= kernel parameter description include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards fs: use writeback_iter directly in mpage_writepages fs: remove useless plus one in super_cache_scan() fs: add S_ANON_INODE fs: remove uselib() system call device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission() fs/fs_parse: Remove unused and problematic validate_constant_table() fs: touch up predicts in inode_permission() ...
2025-05-26Merge tag 'vfs-6.16-rc1.writepage' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull final writepage conversion from Christian Brauner: "This converts vboxfs from ->writepage() to ->writepages(). This was the last user of the ->writepage() method. So remove ->writepage() completely and all references to it" * tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Remove aops->writepage mm: Remove swap_writepage() and shmem_writepage() ttm: Call shmem_writeout() from ttm_backup_backup_page() i915: Use writeback_iter() shmem: Add shmem_writeout() writeback: Remove writeback_use_writepage() migrate: Remove call to ->writepage vboxsf: Convert to writepages 9p: Add a migrate_folio method
2025-05-25Disable FOP_DONTCACHE for now due to bugsLinus Torvalds
This is kind of last-minute, but Al Viro reported that the new FOP_DONTCACHE flag causes memory corruption due to use-after-free issues. This was triggered by commit 974c5e6139db ("xfs: flag as supporting FOP_DONTCACHE"), but that is not the underlying bug - it is just the first user of the flag. Vlastimil Babka suspects the underlying problem stems from the folio_end_writeback() logic introduced in commit fb7d3bc414939 ("mm/filemap: drop streaming/uncached pages when writeback completes"). The most straightforward fix would be to just revert the commit that exposed this, but Matthew Wilcox points out that other filesystems are also starting to enable the FOP_DONTCACHE logic, so this instead disables that bit globally for now. The fix will hopefully end up being trivial and we can just re-enable this logic after more testing, but until such a time we'll have to disable the new FOP_DONTCACHE flag. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20250525083209.GS2023217@ZenIV/ Triggered-by: 974c5e6139db ("xfs: flag as supporting FOP_DONTCACHE") Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-05-22mm: remove WARN_ON_ONCE() in file_has_valid_mmap_hooks()Lorenzo Stoakes
Having encountered a trinity report in linux-next (Linked in the 'Closes' tag) it appears that there are legitimate situations where a file-backed mapping can be acquired but no file->f_op->mmap or file->f_op->mmap_prepare is set, at which point do_mmap() should simply error out with -ENODEV. Since previously we did not warn in this scenario and it appears we rely upon this, restore this situation, while retaining a WARN_ON_ONCE() for the case where both are set, which is absolutely incorrect and must be addressed and thus always requires a warning. If further work is required to chase down precisely what is causing this, then we can later restore this, but it makes no sense to hold up this series to do so, as this is existing and apparently expected behaviour. Link: https://lkml.kernel.org/r/20250514084024.29148-1-lorenzo.stoakes@oracle.com Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202505141434.96ce5e5d-lkp@intel.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-15include/linux/fs.h: add inode_lock_killable()Max Kellermann
Prepare for making inode operations killable while they're waiting for the lock. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/20250513150327.1373061-1-max.kellermann@ionos.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15readdir: supply dir_context.count as readdir buffer size hintMiklos Szeredi
This is a preparation for large readdir buffers in fuse. Simply setting the fuse buffer size to the userspace buffer size should work, the record sizes are similar (fuse's is slightly larger than libc's, so no overflow should ever happen). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jaco Kroon <jaco@uls.co.za> Link: https://lore.kernel.org/20250513151012.1476536-1-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15fuse: don't allow signals to interrupt getdents copyingMiklos Szeredi
When getting the directory contents, the entries are first fetched to a kernel buffer, then they are copied to userspace with dir_emit(). This second phase is non-blocking as long as the userspace buffer is not paged out, making it interruptible makes zero sense. Overload d_type as flags, since it only uses 4 bits from 32. Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/20250513112335.1473177-1-mszeredi@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-13mm: introduce new .mmap_prepare() file callbackLorenzo Stoakes
Patch series "eliminate mmap() retry merge, add .mmap_prepare hook", v2. During the mmap() of a file-backed mapping, we invoke the underlying driver file's mmap() callback in order to perform driver/file system initialisation of the underlying VMA. This has been a source of issues in the past, including a significant security concern relating to unwinding of error state discovered by Jann Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") which performed the recent, significant, rework of mmap() as a whole. However, we have had a fly in the ointment remain - drivers have a great deal of freedom in the .mmap() hook to manipulate VMA state (as well as page table state). This can be problematic, as we can no longer reason sensibly about VMA state once the call is complete (the ability to do - anything - here does rather interfere with that). In addition, callers may choose to do odd or unusual things which might interfere with subsequent steps in the mmap() process, and it may do so and then raise an error, requiring very careful unwinding of state about which we can make no assumptions. Rather than providing such an open-ended interface, this series provides an alternative, far more restrictive one - we expose a whitelist of fields which can be adjusted by the driver, along with immutable state upon which the driver can make such decisions: struct vm_area_desc { /* Immutable state. */ struct mm_struct *mm; unsigned long start; unsigned long end; /* Mutable fields. Populated with initial state. */ pgoff_t pgoff; struct file *file; vm_flags_t vm_flags; pgprot_t page_prot; /* Write-only fields. */ const struct vm_operations_struct *vm_ops; void *private_data; }; The mmap logic then updates the state used to either merge with a VMA or establish a new VMA based upon this logic. This is achieved via new file hook .mmap_prepare(), which is, importantly, invoked very early on in the mmap() process. If an error arises, we can very simply abort the operation with very little unwinding of state required. The existing logic contains another, related, peccadillo - since the .mmap() callback might do anything, it may also cause a previously unmergeable VMA to become mergeable with adjacent VMAs. Right now the logic will retry a merge like this only if the driver changes VMA flags, and changes them in such a way that a merge might succeed (that is, the flags are not 'special', that is do not contain any of the flags specified in VM_SPECIAL). This has also been the source of a great deal of pain - it's hard to reason about an .mmap() callback that might do - anything - but it's also hard to reason about setting up a VMA and writing to the maple tree, only to do it again utilising a great deal of shared state. Since .mmap_prepare() sets fields before the first merge is even attempted, the use of this callback obviates the need for this retry merge logic. A driver may only specify .mmap_prepare() or the deprecated .mmap() callback. In future we may add futher callbacks beyond .mmap_prepare() to faciliate all use cass as we convert drivers. In researching this change, I examined every .mmap() callback, and discovered only a very few that set VMA state in such a way that a. the VMA flags changed and b. this would be mergeable. In the majority of cases, it turns out that drivers are mapping kernel memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable VM_SPECIAL flags. Of those that remain I identified a number of cases which are only applicable in DAX, setting the VM_HUGEPAGE flag: * dax_mmap() * erofs_file_mmap() * ext4_file_mmap() * xfs_file_mmap() For this remerge to not occur and to impact users, each of these cases would require a user to mmap() files using DAX, in parts, immediately adjacent to one another. This is a very unlikely usecase and so it does not appear to be worthwhile to adjust this functionality accordingly. We can, however, very quickly do so if needed by simply adding an .mmap_prepare() callback to these as required. There are two further non-DAX cases I idenitfied: * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with VM_SEQ_READ. * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP. Both of these cases again seem very unlikely to be mmap()'d immediately adjacent to one another in a fashion that would result in a merge. Finally, we are left with a viable case: * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP. This is viable enough that the mm selftests trigger the logic as a matter of course. Therefore, this series replace the .secretmem_mmap() hook with .secret_mmap_prepare(). This patch (of 3): Provide a means by which drivers can specify which fields of those permitted to be changed should be altered to prior to mmap()'ing a range (which may either result from a merge or from mapping an entirely new VMA). Doing so is substantially safer than the existing .mmap() calback which provides unrestricted access to the part-constructed VMA and permits drivers and file systems to do 'creative' things which makes it hard to reason about the state of the VMA after the function returns. The existing .mmap() callback's freedom has caused a great deal of issues, especially in error handling, as unwinding the mmap() state has proven to be non-trivial and caused significant issues in the past, for instance those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour"). It also necessitates a second attempt at merge once the .mmap() callback has completed, which has caused issues in the past, is awkward, adds overhead and is difficult to reason about. The .mmap_prepare() callback eliminates this requirement, as we can update fields prior to even attempting the first merge. It is safer, as we heavily restrict what can actually be modified, and being invoked very early in the mmap() process, error handling can be performed safely with very little unwinding of state required. The .mmap_prepare() and deprecated .mmap() callbacks are mutually exclusive, so we permit only one to be invoked at a time. Update vma userland test stubs to account for changes. Link: https://lkml.kernel.org/r/cover.1746792520.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/adb36a7c4affd7393b2fc4b54cc5cfe211e41f71.1746792520.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-09super: add filesystem freezing helpers for suspend and hibernateChristian Brauner
Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-07fs: add atomic write unit max opt to statxJohn Garry
XFS will be able to support large atomic writes (atomic write > 1x block) in future. This will be achieved by using different operating methods, depending on the size of the write. Specifically a new method of operation based in FS atomic extent remapping will be supported in addition to the current HW offload-based method. The FS method will generally be appreciably slower performing than the HW-offload method. However the FS method will be typically able to contribute to achieving a larger atomic write unit max limit. XFS will support a hybrid mode, where HW offload method will be used when possible, i.e. HW offload is used when the length of the write is supported, and for other times FS-based atomic writes will be used. As such, there is an atomic write length at which the user may experience appreciably slower performance. Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt. When zero, it means that there is no such performance boundary. Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is ok for older kernels which don't support this new field, as they would report 0 in this field (from zeroing in cp_statx()) already. Furthermore those older kernels don't support large atomic writes - apart from block fops, but there would be consistent performance there for atomic writes in range [unit min, unit max]. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-06kill vfs_submount()Al Viro
The last remaining user of vfs_submount() (tracefs) is easy to convert to fs_context_for_submount(); do that and bury that thing, along with SB_SUBMOUNT Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-06fs: add a write stream field to the kiocbChristoph Hellwig
Prepare for io_uring passthrough of write streams. The write stream field in the kiocb structure fits into an existing 2-byte hole, so its size is not changed. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>