| Age | Commit message (Collapse) | Author |
|
svc_set_num_threads() will set the number of running threads for a given
pool. If the pool argument is set to NULL however, it will distribute
the threads among all of the pools evenly.
These divergent codepaths complicate the move to dynamic threading.
Simplify the API by splitting these two cases into different helpers:
Add a new svc_set_pool_threads() function that sets the number of
threads in a single, given pool. Modify svc_set_num_threads() to
distribute the threads evenly between all of the pools and then call
svc_set_pool_threads() for each.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Fix direct I/O on devices that require stable pages by asking iomap
to bounce buffer. To support this, ioends are used for direct reads
in this case to provide a user context for copying data back from the
bounce buffer.
This fixes qemu when used on devices using T10 protection information
and probably other cases like iSCSI using data digests.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new flag that request bounce buffering for direct I/O. This is
needed to provide the stable pages requirement requested by devices
that need to calculate checksums or parity over the data and allows
file systems to properly work with things like T10 protection
information. The implementation just calls out to the new bio bounce
buffering helpers to allocate a bounce buffer, which is used for
I/O and to copy to/from it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Support using the ioend structure to defer I/O completion for direct
reads in addition to writes. This requires a check for the operation
to not merge reads and writes in iomap_ioend_can_merge. This support
will be used for bounce buffered direct I/O reads that need to copy
data back to the user address space on read completion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Match the more descriptive iov_iter terminology instead of encoding
what we do with them for reads only.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are good arguments for processing the user completions ASAP vs.
freeing resources ASAP, but freeing the bio first here removes potential
use after free hazards when checking flags, and will simplify the
upcoming bounce buffer support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Refactor the two per-bio completion handlers to share common code using
a new helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Factor out a separate helper that builds and submits a single bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use iov_iter_count to check if we need to continue as that just reads
a field in the iov_iter, and only use bio_iov_vecs_to_alloc to calculate
the actual number of vectors to allocate for the bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The "if (dio->error)" in iomap_dio_bio_iter exists to stop submitting
more bios when a completion already return an error. Commit cfe057f7db1f
("iomap_dio_actor(): fix iov_iter bugs") made it revert the iov by
"copied", which is very wrong given that we've already consumed that
range and submitted a bio for it.
Fixes: cfe057f7db1f ("iomap_dio_actor(): fix iov_iter bugs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge
xfs: syzbot fixes for online fsck [3/3]
Fix various syzbot complaints about scrub that Jiaming Zhang found.
With a bit of luck, this should all go splendidly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge
xfs: improve shortform attr performance [2/3]
Improve performance of the xattr (and parent pointer) code when the
attr structure is in short format and we can therefore perform all
updates in a single transaction. Avoiding the attr intent code brings
a very nice speedup in those operations.
With a bit of luck, this should all go splendidly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge
xfs: fix problems in the attr leaf freemap code [1/3]
Running generic/753 for hours revealed data corruption problems in the
attr leaf block space management code. Under certain circumstances,
freemap entries are left with zero size but a nonzero offset. If that
offset happens to be the same offset as the end of the entries array
during an attr set operation, the leaf entry table expansion will push
the freemap record offset upwards without checking for overlap with any
other freemap entries. If there happened to be a second freemap entry
overlapping with the newly allocated leaf entry space, then the next
attr set operation might find that space and overwrite the leaf entry,
thereby corrupting the leaf block.
Fix this by zeroing the freemap offset any time we set the size to zero.
If a subsequent attr set operation finds no space in the freemap, it
will compact the block and regenerate the freemaps.
With a bit of luck, this should all go splendidly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge
xfs: autonomous self healing of filesystems [v7]
This patchset builds new functionality to deliver live information about
filesystem health events to userspace. This is done by creating an
anonymous file that can be read() for events by userspace programs.
Events are captured by hooking various parts of XFS and iomap so that
metadata health failures, file I/O errors, and major changes in
filesystem state (unmounts, shutdowns, etc.) can be observed by
programs.
When an event occurs, the hook functions queue an event object to each
event anonfd for later processing. Programs must have CAP_SYS_ADMIN
to open the anonfd and there's a maximum event lag to prevent resource
overconsumption. The events themselves can be read() from the anonfd
as C structs for the xfs_healer daemon.
In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically. This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing. They are auto-started by a starter
service that uses fanotify.
This patchset depends on the new fserror code that Christian Brauner
has tentatively accepted for Linux 7.0:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.0.fserror
v7: more cleanups of the media verification ioctl, improve comments, and
reuse the bio
v6: fix pi-breaking bugs, make verify failures trigger health reports
and filter bio status flags better
v5: add verify-media ioctl, collapse small helper funcs with only
one caller
v4: drop multiple client support so we can make direct calls into
healthmon instead of chasing pointers and doing indirect calls
v3: drag out of rfc status
With a bit of luck, this should all go splendidly.
Conflicts:
This merge required an update on files:
- fs/xfs/xfs_healthmon.c
- fs/xfs/xfs_verify_media.c
Such change was required because a parallel developement changed
XFS header file xfs.h naming to xfs_platform.h, so the merge
required to update those includes in both files above
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Linux file names are in principle limited only to PATH_MAX (which is
4096) but the code in get_rock_ridge_filename() limits them to 253
characters.
As mentioned by Jan Kara, the Rockridge standard to ECMA119/ISO9660 has
no limit of file name length, but this limits file names to the
traditional 255 NAME_MAX value.
Signed-off-by: Shawn Landden <slandden@gmail.com>
Link: https://patch.msgid.link/CA+49okq0ouJvAx0=txR_gyNKtZj55p3Zw4MB8jXZsGr4bEGjRA@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Similar to commit 91ef18b567da ("ext4: mark inodes without acls in
__ext4_iget()"), the ACL state won't be read when the file owner
performs a lookup, and the RCU fast path for lookups won't work
because the ACL state remains unknown.
If there are no extended attributes, or if the xattr filter
indicates that no ACL xattr is present, call cache_no_acl() directly.
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
In the 'DeleteIndexEntryRoot' case of the 'do_action' function, the
entry size ('esize') is retrieved from the log record without adequate
bounds checking.
Specifically, the code calculates the end of the entry ('e2') using:
e2 = Add2Ptr(e1, esize);
It then calculates the size for memmove using 'PtrOffset(e2, ...)',
which subtracts the end pointer from the buffer limit. If 'esize' is
maliciously large, 'e2' exceeds the used buffer size. This results in
a negative offset which, when cast to size_t for memmove, interprets
as a massive unsigned integer, leading to a heap buffer overflow.
This commit adds a check to ensure that the entry size ('esize') strictly
fits within the remaining used space of the index header before performing
memory operations.
Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Corrupted FAT images can leave a directory inode with an incorrect
i_nlink (e.g. 2 even though subdirectories exist). rmdir then
unconditionally calls drop_nlink(dir) and can drive i_nlink to 0,
triggering the WARN_ON in drop_nlink().
Add a sanity check in vfat_rmdir() and msdos_rmdir(): only drop the
parent link count when it is at least 3, otherwise report a filesystem
error.
Link: https://lkml.kernel.org/r/20260101111148.1437-1-zhiyuzhang999@gmail.com
Fixes: 9a53c3a783c2 ("[PATCH] r/o bind mounts: unlink: monitor i_nlink")
Signed-off-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Closes: https://lore.kernel.org/linux-fsdevel/aVN06OKsKxZe6-Kv@casper.infradead.org/T/#t
Tested-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a check to verify the group descriptor has enough free bits before
attempting allocation in ocfs2_move_extent(). This prevents a kernel
BUG_ON crash in ocfs2_block_group_set_bits() when the move_extents ioctl
is called on a crafted or corrupted filesystem.
The existing validation in ocfs2_validate_gd_self() only checks static
metadata consistency (bg_free_bits_count <= bg_bits) when the descriptor
is first read from disk. However, during move_extents operations,
multiple allocations can exhaust the free bits count below the requested
allocation size, triggering BUG_ON(le16_to_cpu(bg->bg_free_bits_count) <
num_bits).
The debug trace shows the issue clearly:
- Block group 32 validated with bg_free_bits_count=427
- Repeated allocations decreased count: 427 -> 171 -> 43 -> ... -> 1
- Final request for 2 bits with only 1 available triggers BUG_ON
By adding an early check in ocfs2_move_extent() right after
ocfs2_find_victim_alloc_group(), we return -ENOSPC gracefully instead of
crashing the kernel. This also avoids unnecessary work in
ocfs2_probe_alloc_group() and __ocfs2_move_extent() when the allocation
will fail.
Link: https://lkml.kernel.org/r/20260104133504.14810-1-kartikey406@gmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+7960178e777909060224@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7960178e777909060224
Link: https://lore.kernel.org/all/20251231115801.293726-1-kartikey406@gmail.com/T/ [v1]
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is no function dlm_mast_regions(). However, dlm_match_regions() is
passed the buffer "local", which it uses internally, so it seems like
dlm_match_regions() was intended.
Link: https://lkml.kernel.org/r/20251230142513.95467-1-Julia.Lawall@inria.fr
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It's rare case that sync_inodes_sb() always skips to flush some drity
datas, so it's enough to give extra three more chances to flush data.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Under stress tests with frequent metadata operations, checkpoint write
time can become excessively long. Analysis shows that the slowdown is
caused by synchronous, one-by-one reads of NAT blocks during checkpoint
processing.
The issue can be reproduced with the following workload:
1. seq 1 650000 | xargs -P 16 -n 1 touch
2. sync # avoid checkpoint write during deleting
3. delete 1 file every 455 files
4. echo 3 > /proc/sys/vm/drop_caches
5. sync # trigger checkpoint write
This patch submits read I/O for all NAT blocks required in the
__flush_nat_entry_set() phase in advance, reducing the overhead of
synchronous waiting for individual NAT block reads.
The NAT block flush latency before and after the change is as below:
| |NAT blocks accessed|NAT blocks read|Flush time (ms)|
|-------------|-------------------|---------------|---------------|
|Before change|1205 |1191 |158 |
|After change |1264 |1242 |11 |
With a similar number of NAT blocks accessed and read from disk, adding
NAT block readahead reduces the total NAT block flush time by more than
90%.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
All callers of __has_cursum_space() pass an unsigned int value as the
size parameter. Change the parameter type to unsigned int accordingly.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch adds separate write latency accounting for NAT and SIT blocks
in f2fs_write_checkpoint().
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For pinned files, the file mapping is already established before
writing, and since the writes are in IPU, there is no need to acquire
the sbi->writepages lock to guarantee write ordering.
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Commit d36de29f4bb5 ("f2fs: sysfs: introduce inject_lock_timeout")
introduces a bug as below, fix it.
cat /sys/fs/f2fs/vdx/inject_lock_timeout
s/fs/f2fs/vdx/inject_lock_timeout: Invalid argument
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In order to simulate skipped write during enable_checkpoint().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any
skipped write during data flush in f2fs_enable_checkpoint().
So in the loop of data flush, if there is any skipped write in previous
flush, let's retry sync_inode_sb(), otherwise, all dirty data written
before f2fs_enable_checkpoint() should have been persisted, then break
the retry loop.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix the the buggy conversion of fuse_reverse_inval_entry() introduced
during the creation rework
- Disallow nfs delegation requests for directories by setting
simple_nosetlease()
- Require an opt-in for getting readdir flag bits outside of S_DT_MASK
set in d_type
- Fix scheduling delayed writeback work by only scheduling when the
dirty time expiry interval is non-zero and cancel the delayed work if
the interval is set to zero
- Use rounded_jiffies_interval for dirty time work
- Check the return value of sb_set_blocksize() for romfs
- Wait for batched folios to be stable in __iomap_get_folio()
- Use private naming for fuse hash size
- Fix the stale dentry cleanup to prevent a race that causes a UAF
* tag 'vfs-6.19-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: document d_dispose_if_unused()
fuse: shrink once after all buckets have been scanned
fuse: clean up fuse_dentry_tree_work()
fuse: add need_resched() before unlocking bucket
fuse: make sure dentry is evicted if stale
fuse: fix race when disposing stale dentries
fuse: use private naming for fuse hash size
writeback: use round_jiffies_relative for dirtytime_work
iomap: wait for batched folios to be stable in __iomap_get_folio
romfs: check sb_set_blocksize() return value
docs: clarify that dirtytime_expire_seconds=0 disables writeback
writeback: fix 100% CPU usage when dirtytime_expire_interval is 0
readdir: require opt-in for d_type flags
vboxsf: don't allow delegations to be set on directories
ceph: don't allow delegations to be set on directories
gfs2: don't allow delegations to be set on directories
9p: don't allow delegations to be set on directories
smb/client: properly disallow delegations on directories
nfs: properly disallow delegation requests on directories
fuse: fix conversion of fuse_reverse_inval_entry() to start_removing()
|
|
XDR enum decoders generated by xdrgen do not verify that incoming
values are valid members of the enum. Incoming out-of-range values
from malicious or buggy peers propagate through the system
unchecked.
Add validation logic to generated enum decoders using a switch
statement that explicitly lists valid enumerator values. The
compiler optimizes this to a simple range check when enum values
are dense (contiguous), while correctly rejecting invalid values
for sparse enums with gaps in their value ranges.
The --no-enum-validation option on the source subcommand disables
this validation when not needed.
The minimum and maximum fields in _XdrEnum, which were previously
unused placeholders for a range-based validation approach, have
been removed since the switch-based validation handles both dense
and sparse enums correctly.
Because the new mechanism results in substantive changes to
generated code, existing .x files are regenerated. Unrelated white
space and semicolon changes in the generated code are due to recent
commit 1c873a2fd110 ("xdrgen: Don't generate unnecessary semicolon")
and commit 38c4df91242b ("xdrgen: Address some checkpatch whitespace
complaints").
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
idmap lookups can time out while the cache is waiting for a userspace
upcall reply. In that case cache_check() returns -ETIMEDOUT to callers.
The nfsd_map_name_to_[ug]id functions currently proceed with attempting
to map the id to a kuid despite a potentially temporary failure to
perform the idmap lookup. This results in the code returning the error
NFSERR_BADOWNER which can cause client operations to return to userspace
with failure.
Fix this by returning the failure status before attempting kuid mapping.
This will return NFSERR_JUKEBOX on idmap lookup timeout so that clients
can retry the operation instead of aborting it.
Fixes: 65e10f6d0ab0 ("nfsd: Convert idmap to use kuids and kgids")
Cc: stable@vger.kernel.org
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
During v4 request compound arg decoding, some ops (e.g. SETATTR)
can trigger idmap lookup upcalls. When those upcall responses get
delayed beyond the allowed time limit, cache_check() will mark the
request for deferral and cause it to be dropped.
This prevents nfs4svc_encode_compoundres from being executed, and
thus the session slot flag NFSD4_SLOT_INUSE never gets cleared.
Subsequent client requests will fail with NFSERR_JUKEBOX, given
that the slot will be marked as in-use, making the SEQUENCE op
fail.
Fix this by making sure that the RQ_USEDEFERRAL flag is always
clear during nfs4svc_decode_compoundargs(), since no v4 request
should ever be deferred.
Fixes: 2f425878b6a7 ("nfsd: don't use the deferral service, return NFS4ERR_DELAY")
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
fstests generic/215 and generic/407 were failing because the server
wasn't updating mtime properly. When deleg attribute support is not
compiled in and thus no attribute delegation was given, the server
was skipping updating mtime and ctime because FMODE_NOCMTIME was
uncoditionally set for the write delegation.
Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Claude pointed out that there is a nfs4_file refcount leak in
nfsd_get_dir_deleg(). Ensure that the reference to "fp" is released
before returning.
Fixes: 8b99f6a8c116 ("nfsd: wire up GET_DIR_DELEGATION handling")
Cc: stable@vger.kernel.org
Cc: Chris Mason <clm@meta.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
"nfsd: provide locking for v4_end_grace" introduced a
client_tracking_active flag protected by nn->client_lock to prevent
the laundromat from being scheduled before client tracking
initialization or after shutdown begins. That commit is suitable for
backporting to LTS kernels that predate commit 86898fa6b8cd
("workqueue: Implement disable/enable for (delayed) work items").
However, the workqueue subsystem in recent kernels provides
enable_delayed_work() and disable_delayed_work_sync() for this
purpose. Using this mechanism enable us to remove the
client_tracking_active flag and associated spinlock operations
while preserving the same synchronization guarantees, which is
a cleaner long-term approach.
Signed-off-by: NeilBrown <neil@brown.name>
Tested-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
A documenting comment in include/uapi/linux/nfs.h claims incorrectly
that NFSv2 defines NFSERR_INVAL. There is no such definition in either
RFC 1094 or https://pubs.opengroup.org/onlinepubs/9629799/chap7.htm
NFS3ERR_INVAL is introduced in RFC 1813.
NFSD returns NFSERR_INVAL for PROC_GETACL, which has no
specification (yet).
However, nfsd_map_status() maps nfserr_symlink and nfserr_wrong_type
to nfserr_inval, which does not align with RFC 1094. This logic was
introduced only recently by commit 438f81e0e92a ("nfsd: move error
choice for incorrect object types to version-specific code."). Given
that we have no INVAL or SERVERFAULT status in NFSv2, probably the
only choice is NFSERR_IO.
Fixes: 438f81e0e92a ("nfsd: move error choice for incorrect object types to version-specific code.")
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Make it distinct that this message comes from nfsd.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
FILE_LOCK_DEFERRED can be returned when creating or removing a lock, but
not when testing for a lock. This support was explicitly removed in
Commit 09802fd2a8ca ("lockd: rip out deferred lock handling from testlock codepath")
However the test in nlmsvc_testlock() suggests that it *can* be returned,
only nlm cannot handle it.
To aid clarity, remove the test and instead put a similar test and
warning in vfs_test_lock(). If the impossible happens, convert
FILE_LOCK_DEFERRED to -EIO.
Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
xdrgen requires a number of Python packages on the build system. We
don't want to add these to the kernel build dependency list, which
is long enough already.
The generated files are generated manually using
$ cd fs/nfsd && make xdrgen
whenever the .x files are modified, then they are checked into the
kernel repo so others do not need to rebuild them.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The @rqstp parameter was introduced in commit 3c8e03166ae2 ("NFSv4:
do exact check about attribute specified") but has never been used.
Reduce indentation in callers to improve readability.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Commit a475c5dd16e5 ("gfs2: Free quota data objects synchronously")
started freeing quota data objects during filesystem shutdown instead of
putting them back onto the LRU list, but it failed to remove these
objects from the LRU list, causing LRU list corruption. This caused
use-after-free when the shrinker (gfs2_qd_shrink_scan) tried to access
already-freed objects on the LRU list.
Fix this by removing qd objects from the LRU list before freeing them in
qd_put().
Initial fix from Deepanshu Kartikey <kartikey406@gmail.com>.
Fixes: a475c5dd16e5 ("gfs2: Free quota data objects synchronously")
Reported-by: syzbot+046b605f01802054bff0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=046b605f01802054bff0
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Introduce glock_type(), glock_number(), and glock_sbd() helpers for
accessing a glock's type, number, and super block pointer more easily.
Created with Coccinelle using the following semantic patch:
@@ struct gfs2_glock *gl; @@
- gl->gl_name.ln_type
+ glock_type(gl)
@@ struct gfs2_glock *gl; @@
- gl->gl_name.ln_number
+ glock_number(gl)
@@ struct gfs2_glock *gl; @@
- gl->gl_name.ln_sbd
+ glock_sbd(gl)
glock_sbd() is a macro because it is used with const as well as
non-const struct gfs2_glock * arguments.
Instances in macro definitions, particularly in tracepoint definitions,
replaced by hand.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Use lockref_get_not_dead() instead of an unguarded __lockref_is_dead()
check.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
GL_GLOCK_MIN_HOLD represents the minimum time (in jiffies) that a glock
should be held before being eligible for release. It is currently
defined as 10, meaning that the duration depends on the timer interrupt
frequency (CONFIG_HZ). Change that time to a constant 10ms independent
of CONFIG_HZ. On CONFIG_HZ=1000 systems, the value remains the same.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Fix the type of gfs2_log_get_bio()'s op argument: callers pass in a
blk_opf_t value and the function passes that value on as a blk_opf_t
value, so the current argument type makes no sense.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Pass the start sector into gfs2_chain_bio(): the new bio isn't
necessarily contiguous with the previous one.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Pass the right blk_opf_t value to bio_alloc() so that ->bi_ops is
initialized correctly and doesn't have to be changed later. Adjust the
call chain to pass that value through to where it is needed (and only
there).
Add a separate blk_opf_t argument to gfs2_chain_bio() instead of copying
the value from the previous bio.
Fixes: 8a157e0a0aa5 ("gfs2: Fix use of bio_chain")
Reported-by: syzbot+f6539d4ce3f775aee0cc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f6539d4ce3f775aee0cc
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Rename gfs2_log_submit_bio() to gfs2_log_submit_write(): this function
isn't used for submitting log reads.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Trying to change the state of a glock may result in a "conversion
deadlock" error, indicating that the requested state transition would
cause a deadlock. In this case, we unlock and retry the state change.
It makes no sense to try canceling those unlock requests, but the
current code is not aware of that.
In addition, if a locking request is canceled shortly after it is made,
the cancelation request can currently overtake the locking request.
This may result in deadlocks.
Fix both of these bugs by repurposing the GLF_PENDING_REPLY flag into a
GLF_MAY_CANCEL flag which is set only when the current locking request
can be canceled. When trying to cancel a locking request in
gfs2_glock_dq(), wait for this flag to be set.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In run_queue(), instead of always setting the GLF_LOCK flag, only set it
when the flag is actually needed. This avoids having to undo the flag
setting later.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|