| Age | Commit message (Collapse) | Author |
|
commit c2c185be5c85d37215397c8e8781abf0a69bec1f upstream.
There's a gap between when the buffer was grabbed and when it
potentially gets recycled, where if the list is empty, someone could've
upgraded it to a ring provided type. This can happen if the request
is forced via io-wq. The legacy recycling is missing checking if the
buffer_list still exists, and if it's of the correct type. Add those
checks.
Cc: stable@vger.kernel.org
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c36e28becd0586ac98318fd335e5e91d19cd2623 upstream.
IORING_SEND_VECTORIZED with registered buffers is not implemented but
could be. Don't silently ignore the flag in this case but reject it with
an error. It only affects sendzc as normal sends don't support
registered buffers.
Fixes: 6f02527729bd3 ("io_uring/net: Allow to do vectorized send")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 531bb98a030cc1073bd7ed9a502c0a3a781e92ee upstream.
Refill queue entries are shared with the user space, use READ_ONCE when
reading them.
Fixes: 34a3e60821ab9 ("io_uring/zcrx: implement zerocopy receive pp memory provider");
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 003049b1c4fb8aabb93febb7d1e49004f6ad653b ]
The io_zcrx_put_niov_uref() function uses a non-atomic
check-then-decrement pattern (atomic_read followed by separate
atomic_dec) to manipulate user_refs. This is serialized against other
callers by rq_lock, but io_zcrx_scrub() modifies the same counter with
atomic_xchg() WITHOUT holding rq_lock.
On SMP systems, the following race exists:
CPU0 (refill, holds rq_lock) CPU1 (scrub, no rq_lock)
put_niov_uref:
atomic_read(uref) - 1
// window opens
atomic_xchg(uref, 0) - 1
return_niov_freelist(niov) [PUSH #1]
// window closes
atomic_dec(uref) - wraps to -1
returns true
return_niov(niov)
return_niov_freelist(niov) [PUSH #2: DOUBLE-FREE]
The same niov is pushed to the freelist twice, causing free_count to
exceed nr_iovs. Subsequent freelist pushes then perform an out-of-bounds
write (a u32 value) past the kvmalloc'd freelist array into the adjacent
slab object.
Fix this by replacing the non-atomic read-then-dec in
io_zcrx_put_niov_uref() with an atomic_try_cmpxchg loop that atomically
tests and decrements user_refs. This makes the operation safe against
concurrent atomic_xchg from scrub without requiring scrub to acquire
rq_lock.
Fixes: 34a3e60821ab ("io_uring/zcrx: implement zerocopy receive pp memory provider")
Cc: stable@vger.kernel.org
Signed-off-by: Kai Aizen <kai@snailsploit.com>
[pavel: removed a warning and a comment]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 600b665b903733bd60334e86031b157cc823ee55 ]
Attempting SOCKET_URING_OP_SETSOCKOPT on an AF_NETLINK socket resulted
in an -EOPNOTSUPP, as AF_NETLINK doesn't have an ioctl in its struct
proto, but only in struct proto_ops.
Prior to the blamed commit, io_uring_cmd_sock() only had two cmd_op
operations, both requiring ioctl, thus the check was warranted.
Since then, 4 new cmd_op operations have been added, none of which
depend on ioctl. This patch moves the ioctl check, so it only applies
to the original operations.
AFAICT, the ioctl requirement was unintentional, and it wasn't
visible in the blamed patch within 3 lines of context.
Cc: stable@vger.kernel.org
Fixes: a5d2f99aff6b ("io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT")
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a983aae397767e9da931128ff2b5bf9066513ce3 ]
In an unlikely case when io_populate_area_dma() fails, which could only
happen on a PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA machine,
io_zcrx_map_area() will have an initialised and not freed table. It was
supposed to be cleaned up in the error path, but !is_mapped prevents
that.
Fixes: 439a98b972fbb ("io_uring/zcrx: deduplicate area mapping")
Cc: stable@vger.kernel.org
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f4d0668b38d8784f33a9a36c72ed5d0078247538 ]
__io_fixed_fd_install() returns 0 on success for non-alloc mode
(specific slot), not the slot index. io_pipe_fixed() used this return
value directly as the slot index in fds[], which can cause the reported
values returned via copy_to_user() to be incorrect, or the error path
operating on the incorrect direct descriptor.
Fix by computing the actual 0-based slot index (slot - 1) for specific
slot mode, while preserving the existing behavior for auto-alloc mode
where __io_fixed_fd_install() already returns the allocated index.
Cc: stable@vger.kernel.org
Fixes: 53db8a71ecb4 ("io_uring: add support for IORING_OP_PIPE")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a6bded921ed35f21b3f6bd8e629bf488499ca442 ]
Explicit fixed file install/remove operations on slots outside the
configured alloc range can corrupt alloc_hint via io_file_bitmap_set()
and io_file_bitmap_clear(), which unconditionally update alloc_hint to
the bit position. This causes subsequent auto-allocations to fall
outside the configured range.
For example, if the alloc range is [10, 20) and a file is removed at
slot 2, alloc_hint gets set to 2. The next auto-alloc then starts
searching from slot 2, potentially returning a slot below the range.
Fix this by clamping alloc_hint to [file_alloc_start, file_alloc_end)
at the top of io_file_bitmap_get() before starting the search.
Cc: stable@vger.kernel.org
Fixes: 6e73dffbb93c ("io_uring: let to set a range for file slot allocation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 806ae939c41e5da1d94a1e2b31f5702e96b6c3e3 ]
If a send bundle has picked a bunch of buffers, then it needs to send
all of those to be complete. This may require poll arming, if the send
buffer ends up being full. Once a send bundle has been poll armed, no
further bundles should be attempted.
This allows a current bundle to complete even though it needs to go
through polling to do so, but it will not allow another bundle to be
started once that has happened. Ideally we would abort a bundle if it
was only partially sent, but as some parts of it already went out on the
wire, this obviously isn't feasible. Not continuing more bundle attempts
post encountering a full socket buffer is the second best thing.
Cc: stable@vger.kernel.org
Fixes: a05d1f625c7a ("io_uring/net: support bundles for send")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 42b12cb5fd4554679bac06bbdd05dc8b643bcc42 ]
syzbot correctly reports this as a KCSAN race, as ctx->cached_cq_tail
should be read under ->uring_lock. This isn't immediately feasible in
io_flush_timeouts(), but as long as we read a stable value, that should
be good enough. If two io-wq threads compete on this value, then they
will both end up calling io_flush_timeouts() and at least one of them
will see the correct value.
Reported-by: syzbot+6c48db7d94402407301e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit b8201b50e403815f941d1c6581a27fdbfe7d0fd4 upstream.
Get rid of some redundant checks and move the src arg validation to
before the buffer table allocation, which simplifies error handling.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 22dbb0987bd1e0ec3b1e4ad20756a98f99aa4a08 ]
By having them share the same space in struct io_cancel_data, it ends up
disallowing IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_USERDATA from
working. Eg you cannot match on both a file and user_data for
cancelation purposes. This obviously isn't a common use case as nobody
has reported this, but it does result in -ENOENT potentially being
returned when trying to match on both, rather than actually doing what
the API says it would.
Fixes: 4bf94615b888 ("io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 56112578c71213a10c995a56835bddb5e9ab1ed0 ]
io_key_has_sqarray static branch can be easily switched on/off by the
user every time patching the kernel. That can be very disruptive as it
might require heavy synchronisation across all CPUs. Use deferred static
keys, which can rate-limit it by deferring, batching and potentially
effectively eliminating dec+inc pairs.
Fixes: 9b296c625ac1d ("io_uring: static_key for !IORING_SETUP_NO_SQARRAY")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 442ae406603a94f1a263654494f425302ceb0445 ]
io_register_pbuf_ring() ignores the return value of io_buffer_add_list(),
which can fail if xa_store() returns an error (e.g., -ENOMEM). When this
happens, the function returns 0 (success) to the caller, but the
io_buffer_list structure is neither added to the xarray nor freed.
In practice this requires failure injection to hit, hence not a real
issue. But it should get fixed up none the less.
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 649dd18f559891bdafc5532d737c7dfb56060a6d ]
Check if the passed in offset is negative once cast to sync->off. This
ensures that -EINVAL is returned for that case, like it would be for
sync_file_range(2).
Fixes: c992fe2925d7 ("io_uring: add fsync support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7a8737e1132ff07ca225aa7a4008f87319b5b1ca ]
io_uring_enter(), __io_msg_ring_data(), and io_msg_send_fd() read
ctx->flags and ctx->submitter_task without holding the ctx's uring_lock.
This means they may race with the assignment to ctx->submitter_task and
the clearing of IORING_SETUP_R_DISABLED from ctx->flags in
io_register_enable_rings(). Ensure the correct ordering of the
ctx->flags and ctx->submitter_task memory accesses by storing to
ctx->flags using release ordering and loading it using acquire ordering.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 4add705e4eeb ("io_uring: remove io_register_submitter")
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 38cfdd9dd279473a73814df9fd7e6e716951d361 ]
Add cond_resched() in those dump loops, just in case a lot of entries
are being dumped. And detect invalid CQ ring head/tail entries, to avoid
iterating more than what is necessary. Generally not an issue, but can be
if things like KASAN or other debugging metrics are enabled.
Reported-by: 是参差 <shicenci@gmail.com>
Link: https://lore.kernel.org/all/PS1PPF7E1D7501FE5631002D242DD89403FAB9BA@PS1PPF7E1D7501F.apcprd02.prod.outlook.com/
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 91214661489467f8452d34edbf257488d85176e4 upstream.
io_uring keeps a per-task io-wq around, even when the task no longer has
any io_uring instances.
If the task previously used io_uring for file I/O, this can leave an
unrelated iou-wrk-* worker thread behind after the last io_uring
instance is gone.
When the last io_uring ctx is removed from the task context, mark the
io-wq exit-on-idle so workers can go away. Clear the flag on subsequent
io_uring usage.
Signed-off-by: Li Chen <me@linux.beauty>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 38aa434ab9335ce2d178b7538cdf01d60b2014c3 upstream.
io-wq uses an idle timeout to shrink the pool, but keeps the last worker
around indefinitely to avoid churn.
For tasks that used io_uring for file I/O and then stop using io_uring,
this can leave an iou-wrk-* thread behind even after all io_uring
instances are gone. This is unnecessary overhead and also gets in the
way of process checkpoint/restore.
Add an exit-on-idle state that makes all io-wq workers exit as soon as
they become idle, and provide io_wq_set_exit_on_idle() to toggle it.
Signed-off-by: Li Chen <me@linux.beauty>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0ae91d8ab70922fb74c22c20bedcb69459579b1c ]
d9f595b9a65e ("io_uring/zcrx: fix leaking pages on sg init fail") fixed
a page leakage but didn't free the page array, release it as well.
Fixes: b84621d96ee02 ("io_uring/zcrx: allocate sgtable for umem areas")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4b9748055457ac3a0710bf210c229d01ea1b01b9 ]
If a read/write request goes through io_req_rw_cleanup() and has an
allocated iovec attached and fails to put to the rw_cache, then it may
end up with an unaccounted iovec pointer. Have io_rw_recycle() return
whether it recycled the request or not, and use that to gauge whether to
free a potential iovec or not.
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fc5ff2500976cd2710a7acecffd12d95ee4f98fc ]
Allocate the overflowing CQE with GFP_NOWAIT instead of GFP_ATOMIC. This
changes causes allocations to fail earlier in out-of-memory situations,
rather than being deferred. Using GFP_ATOMIC allows a process to exceed
memory limits.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220794
Signed-off-by: Alexandre Negrel <alexandre@negrel.dev>
Link: https://lore.kernel.org/io-uring/20251229201933.515797-1-alexandre@negrel.dev/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 10dc959398175736e495f71c771f8641e1ca1907 upstream.
Currently this is checked before running the pending work. Normally this
is quite fine, as work items either end up blocking (which will create a
new worker for other items), or they complete fairly quickly. But syzbot
reports an issue where io-wq takes seemingly forever to exit, and with a
bit of debugging, this turns out to be because it queues a bunch of big
(2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't
support ->read_iter(), loop_rw_iter() ends up handling them. Each read
returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of
these pending, processing the whole chain can take a long time. Easily
longer than the syzbot uninterruptible sleep timeout of 140 seconds.
This then triggers a complaint off the io-wq exit path:
INFO: task syz.4.135:6326 blocked for more than 143 seconds.
Not tainted syzkaller #0
Blocked by coredump.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.4.135 state:D stack:26824 pid:6326 tgid:6324 ppid:5957 task_flags:0x400548 flags:0x00080000
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5256 [inline]
__schedule+0x1139/0x6150 kernel/sched/core.c:6863
__schedule_loop kernel/sched/core.c:6945 [inline]
schedule+0xe7/0x3a0 kernel/sched/core.c:6960
schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75
do_wait_for_common kernel/sched/completion.c:100 [inline]
__wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121
io_wq_exit_workers io_uring/io-wq.c:1328 [inline]
io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356
io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203
io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651
io_uring_files_cancel include/linux/io_uring.h:19 [inline]
do_exit+0x2ce/0x2bd0 kernel/exit.c:911
do_group_exit+0xd3/0x2a0 kernel/exit.c:1112
get_signal+0x2671/0x26d0 kernel/signal.c:3034
arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337
__exit_to_user_mode_loop kernel/entry/common.c:41 [inline]
exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline]
do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa02738f749
RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098
RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98
There's really nothing wrong here, outside of processing these reads
will take a LONG time. However, we can speed up the exit by checking the
IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will
exit the ring after queueing up all of these reads. Then once the first
item is processed, io-wq will simply cancel the rest. That should avoid
syzbot running into this complaint again.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/
Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit da579f05ef0faada3559e7faddf761c75cdf85e1 upstream.
With IORING_SETUP_DEFER_TASKRUN, task work is queued to ctx->work_llist
(local work) rather than the fallback list. During io_ring_exit_work(),
io_move_task_work_from_local() was called once before the cancel loop,
moving work from work_llist to fallback_llist.
However, task work can be added to work_llist during the cancel loop
itself. There are two cases:
1) io_kill_timeouts() is called from io_uring_try_cancel_requests() to
cancel pending timeouts, and it adds task work via io_req_queue_tw_complete()
for each cancelled timeout:
2) URING_CMD requests like ublk can be completed via
io_uring_cmd_complete_in_task() from ublk_queue_rq() during canceling,
given ublk request queue is only quiesced when canceling the 1st uring_cmd.
Since io_allowed_defer_tw_run() returns false in io_ring_exit_work()
(kworker != submitter_task), io_run_local_work() is never invoked,
and the work_llist entries are never processed. This causes
io_uring_try_cancel_requests() to loop indefinitely, resulting in
100% CPU usage in kworker threads.
Fix this by moving io_move_task_work_from_local() inside the cancel
loop, ensuring any work on work_llist is moved to fallback before
each cancel attempt.
Cc: stable@vger.kernel.org
Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e0392a10c9e80a3991855a81317da3039fcbe32c upstream.
A previous commit added this helper, and had it terminate if false is
returned from the handler. However, that is completely opposite, it
should abort the loop if true is returned.
Fix this up by having io_wq_for_each_worker() keep iterating as long
as false is returned, and only abort if true is returned.
Cc: stable@vger.kernel.org
Fixes: 751eedc4b4b7 ("io_uring/io-wq: move worker lists to struct io_wq_acct")
Reported-by: Lewis Campbell <info@lewiscampbell.tech>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 525916ce496615f531091855604eab9ca573b195 upstream.
When cloning with node replacements (IORING_REGISTER_DST_REPLACE),
destination entries after the cloned range are not copied over.
Add logic to copy them over to the new destination table.
Fixes: c1329532d5aa ("io_uring/rsrc: allow cloning with node replacements")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b14fad555302a2104948feaff70503b64c80ac01 upstream.
__io_openat_prep() allocates a struct filename using getname(). However,
for the condition of the file being installed in the fixed file table as
well as having O_CLOEXEC flag set, the function returns early. At that
point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this,
the memory for the newly allocated struct filename is not cleaned up,
causing a memory leak.
Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the
successful getname() call, so that when the request is torn down, the
filename will be cleaned up, along with other resources needing cleanup.
Reported-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f
Tested-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com>
Fixes: b9445598d8c6 ("io_uring: openat directly into fixed fd table")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e15cb2200b934e507273510ba6bc747d5cde24a3 upstream.
Using min_wait, two timeouts are given:
1) The min_wait timeout, within which up to 'wait_nr' events are
waited for.
2) The overall long timeout, which is entered if no events are generated
in the min_wait window.
If the min_wait has expired, any event being posted must wake the task.
For SQPOLL, that isn't the case, as it won't trigger the io_has_work()
condition, as it will have already processed the task_work that happened
when an event was posted. This causes any event to trigger post the
min_wait to not always cause the waiting application to wakeup, and
instead it will wait until the overall timeout has expired. This can be
shown in a test case that has a 1 second min_wait, with a 5 second
overall wait, even if an event triggers after 1.5 seconds:
axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring
info: MIN_TIMEOUT supported: true, features: 0x3ffff
info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4
info: 1 cqes in 5000.2ms
where the expected result should be:
axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring
info: MIN_TIMEOUT supported: true, features: 0x3ffff
info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4
info: 1 cqes in 1500.3ms
When the min_wait timeout triggers, reset the number of completions
needed to wake the task. This should ensure that any future events will
wake the task, regardless of how many events it originally wanted to
wait for.
Reported-by: Tip ten Brink <tip@tenbrinkmeijs.com>
Cc: stable@vger.kernel.org
Fixes: 1100c4a2656d ("io_uring: add support for batch wait timeout")
Link: https://github.com/axboe/liburing/issues/1477
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 84230ad2d2afbf0c44c32967e525c0ad92e26b4e upstream.
When the core of io_uring was updated to handle completions
consistently and with fixed return codes, the POLL_REMOVE opcode
with updates got slightly broken. If a POLL_ADD is pending and
then POLL_REMOVE is used to update the events of that request, if that
update causes the POLL_ADD to now trigger, then that completion is lost
and a CQE is never posted.
Additionally, ensure that if an update does cause an existing POLL_ADD
to complete, that the completion value isn't always overwritten with
-ECANCELED. For that case, whatever io_poll_add() set the value to
should just be retained.
Cc: stable@vger.kernel.org
Fixes: 97b388d70b53 ("io_uring: handle completions in the core")
Reported-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com
Tested-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 114ea9bbaf7681c4d363e13b7916e6fef6a4963a ]
io_import_kbuf() calculates nr_segs incorrectly when iov_offset is
non-zero after iov_iter_advance(). It doesn't account for the partial
consumption of the first bvec.
The problem comes when meet the following conditions:
1. Use UBLK_F_AUTO_BUF_REG feature of ublk.
2. The kernel will help to register the buffer, into the io uring.
3. Later, the ublk server try to send IO request using the registered
buffer in the io uring, to read/write to fuse-based filesystem, with
O_DIRECT.
>From a userspace perspective, the ublk server thread is blocked in the
kernel, and will see "soft lockup" in the kernel dmesg.
When ublk registers a buffer with mixed-size bvecs like [4K]*6 + [12K]
and a request partially consumes a bvec, the next request's nr_segs
calculation uses bvec->bv_len instead of (bv_len - iov_offset).
This causes fuse_get_user_pages() to loop forever because nr_segs
indicates fewer pages than actually needed.
Specifically, the infinite loop happens at:
fuse_get_user_pages()
-> iov_iter_extract_pages()
-> iov_iter_extract_bvec_pages()
Since the nr_segs is miscalculated, the iov_iter_extract_bvec_pages
returns when finding that i->nr_segs is zero. Then
iov_iter_extract_pages returns zero. However, fuse_get_user_pages does
still not get enough data/pages, causing infinite loop.
Example:
- Bvecs: [4K, 4K, 4K, 4K, 4K, 4K, 12K, ...]
- Request 1: 32K at offset 0, uses 6*4K + 8K of the 12K bvec
- Request 2: 32K at offset 32K
- iov_offset = 8K (8K already consumed from 12K bvec)
- Bug: calculates using 12K, not (12K - 8K) = 4K
- Result: nr_segs too small, infinite loop in fuse_get_user_pages.
Fix by accounting for iov_offset when calculating the first segment's
available length.
Fixes: b419bed4f0a6 ("io_uring/rsrc: ensure segments counts are correct on kbuf buffers")
Signed-off-by: huang-jl <huang-jl@deepseek.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 78385c7299f7514697d196b3233a91bd5e485591 ]
The struct io_uring_buf elements in a buffer ring are in a memory region
accessible from userspace. A malicious/buggy userspace program could
therefore write to them at any time, so they should be accessed with
READ_ONCE() in the kernel. Commit 98b6fa62c84f ("io_uring/kbuf: always
use READ_ONCE() to read ring provided buffer lengths") already switched
the reads of the len field to READ_ONCE(). Do the same for bid and addr.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Cc: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b6c5f9454ef34fd2753ba7843ef4d9a295c43eee ]
netdev ops must be called under instance lock or rtnl_lock, but
io_register_zcrx_ifq() isn't doing this for netdev_queue_get_dma_dev().
Fix this by taking the instance lock using netdev_get_by_index_lock().
Extended the instance lock section to include attaching a memory
provider. Could not move io_zcrx_create_area() outside, since the dmabuf
codepath IORING_ZCRX_AREA_DMABUF requires ifq->dev.
Fixes: 59b8b32ac8d4 ("io_uring/zcrx: add support for custom DMA devices")
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 93e197e524b14d185d011813b72773a1a49d932d ]
IORING_SETUP_NO_MMAP rings remain user accessible even before the ctx
setup is finalised, so use WRITE_ONCE consistently when initialising
rings.
Fixes: 03d89a2de25bb ("io_uring: support for user allocated memory for rings/sqes")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
I started to see zcrx data corruptions. That turned out to be due
to CQ tail pointing to a stale entry which happened to be from
a zcrx request. I.e. the tail is incremented without the CQE
memory being changed.
The culprit is __io_cqring_overflow_flush() passing "cqe32=true"
to io_get_cqe_overflow() for non-mixed CQE32 setups, which only
expects it to be set for mixed 32B CQEs and not for SETUP_CQE32.
The fix is slightly hacky, long term it's better to unify mixed and
CQE32 handling.
Fixes: e26dca67fde19 ("io_uring: add support for IORING_SETUP_CQE_MIXED")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When support for vectored registered buffers was added, the import
itself is using 'req' rather than the notification io_kiocb, sr->notif.
For non-vectored imports, sr->notif is correctly used. This is important
as the lifetime of the two may be different. Use the correct io_kiocb
for the vectored buffer import.
Cc: stable@vger.kernel.org
Fixes: 23371eac7d9a ("io_uring/net: implement vectored reg bufs for zctx")
Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-463332873@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If timestamp retriving needs to be retried and the local list of
SKB's already has entries, then it's spliced back into the socket
queue. However, the arguments for the splice helper are transposed,
causing exactly the wrong direction of splicing into the on-stack
list. Fix that up.
Cc: stable@vger.kernel.org
Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-462435176@google.com>
Fixes: 9e4ed359b8ef ("io_uring/netcmd: add tx timestamping cmd support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
io_buffer_register_bvec() currently uses blk_rq_nr_phys_segments() as
the number of bvecs in the request. However, bvecs may be split into
multiple segments depending on the queue limits. Thus, the number of
segments may overestimate the number of bvecs. For ublk devices, the
only current users of io_buffer_register_bvec(), virt_boundary_mask,
seg_boundary_mask, max_segments, and max_segment_size can all be set
arbitrarily by the ublk server process.
Set imu->nr_bvecs based on the number of bvecs the rq_for_each_bvec()
loop actually yields. However, continue using blk_rq_nr_phys_segments()
as an upper bound on the number of bvecs when allocating imu to avoid
needing to iterate the bvecs a second time.
Link: https://lore.kernel.org/io-uring/20251111191530.1268875-1-csander@purestorage.com/
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 27cb27b6d5ea ("io_uring: add support for kernel registered bvecs")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's useful to know which query opcodes are available. Extend the
structure and return that. It's a trivial change, and even though it can
be painlessly extended later, it'd still require adding a v2 of the
structure.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A previous commit reused the recyling infrastructure for early cleanup,
but this is not enough for the case where our internal caches have
overflowed. If this happens, then the allocated iovec can get leaked if
the request is also aborted early.
Reinstate the previous forced free of the iovec for that situation.
Cc: stable@vger.kernel.org
Reported-by: syzbot+3c93637d7648c24e1fd0@syzkaller.appspotmail.com
Tested-by: syzbot+3c93637d7648c24e1fd0@syzkaller.appspotmail.com
Fixes: 9ac273ae3dc2 ("io_uring/rw: use io_rw_recycle() from cleanup path")
Link: https://lore.kernel.org/io-uring/69122a59.a70a0220.22f260.00fd.GAE@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a report of io_estimate_bvec_size() truncating the calculated
number of segments that leads to corruption issues. Check it doesn't
overflow "int"s used later. Rough but simple, can be improved on top.
Cc: stable@vger.kernel.org
Fixes: 9ef4cbbcb4ac3 ("io_uring: add infra for importing vectored reg buffers")
Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-458654612@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Günther Noack <gnoack@google.com>
Tested-by: Günther Noack <gnoack@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
->nr_pages is int, it needs type extension before calculating the region
size.
Fixes: a90558b36ccee ("io_uring/memmap: helper for pinning region pages")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: style fixup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a better way to handle the problem IORING_REGISTER_ZCRX_REFILL
solves. The uapi can also be slightly adjusted to accommodate future
extensions. Remove the feature for now, it'll be reworked for the next
release.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 620a50c92700 ("io_uring: uring_cmd: add multishot support") added
multishot uring_cmd support with explicit buffer upfront commit via
io_uring_mshot_cmd_post_cqe(). However, the buffer selection path in
io_ring_buffer_select() was auto-committing buffers for non-pollable files,
which conflicts with uring_cmd's explicit upfront commit model.
This way consumes the whole selected buffer immediately, and causes
failure on the following buffer selection.
Fix this by checking uring_cmd to identify operations that handle buffer
commit explicitly, and skip auto-commit for these operations.
Cc: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 620a50c92700 ("io_uring: uring_cmd: add multishot support")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The __must_hold annotation references &req->ctx->uring_lock, but req
is not in scope in io_install_fixed_file. This change updates the
annotation to reference the correct ctx->uring_lock.
improving code clarity.
Fixes: f110ed8498af ("io_uring: split out fixed file installation and removal")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Fix the indentation to ensure consistent code style and improve
readability and to fix the errors:
ERROR: code indent should use tabs where possible
+ return io_net_import_vec(req, kmsg, sr->buf, sr->len, ITER_SOURCE);$
ERROR: code indent should use tabs where possible
+^I^I^I struct io_big_cqe *big_cqe)$
Tested by running the /scripts/checkpatch.pl
Signed-off-by: Ranganath V N <vnranganath.20@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The current approach is a bit naive, and hence calls the time querying
way too often. Only start the "doing work" timer when there's actual
work to do, and then use that information to terminate (and account) the
work time once done. This greatly reduces the frequency of these calls,
when they cannot have changed anyway.
Running a basic random reader that is setup to use SQPOLL, a profile
before this change shows these as the top cycle consumers:
+ 32.60% iou-sqp-1074 [kernel.kallsyms] [k] thread_group_cputime_adjusted
+ 19.97% iou-sqp-1074 [kernel.kallsyms] [k] thread_group_cputime
+ 12.20% io_uring io_uring [.] submitter_uring_fn
+ 4.13% iou-sqp-1074 [kernel.kallsyms] [k] getrusage
+ 2.45% iou-sqp-1074 [kernel.kallsyms] [k] io_submit_sqes
+ 2.18% iou-sqp-1074 [kernel.kallsyms] [k] __pi_memset_generic
+ 2.09% iou-sqp-1074 [kernel.kallsyms] [k] cputime_adjust
and after this change, top of profile looks as follows:
+ 36.23% io_uring io_uring [.] submitter_uring_fn
+ 23.26% iou-sqp-819 [kernel.kallsyms] [k] io_sq_thread
+ 10.14% iou-sqp-819 [kernel.kallsyms] [k] io_sq_tw
+ 6.52% iou-sqp-819 [kernel.kallsyms] [k] tctx_task_work_run
+ 4.82% iou-sqp-819 [kernel.kallsyms] [k] nvme_submit_cmds.part.0
+ 2.91% iou-sqp-819 [kernel.kallsyms] [k] io_submit_sqes
[...]
0.02% iou-sqp-819 [kernel.kallsyms] [k] cputime_adjust
where it's spending the cycles on things that actually matter.
Reported-by: Fengnan Chang <changfengnan@bytedance.com>
Cc: stable@vger.kernel.org
Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
getrusage() does a lot more than what the SQPOLL accounting needs, the
latter only cares about (and uses) the stime. Rather than do a full
RUSAGE_SELF summation, just query the used stime instead.
Cc: stable@vger.kernel.org
Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The negation operator is incorrectly placed outside the unlikely()
macro:
if (!unlikely(iwa))
This inverts the compiler branch prediction hint, marking the NULL case
as likely instead of unlikely. The intent is to indicate that allocation
failures are rare, consistent with common kernel patterns.
Moving the negation inside unlikely():
if (unlikely(!iwa))
Fixes: 2b4fc4cd43f2 ("io_uring/waitid: setup async data in the prep handler")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Both the read and write side use kiocb_done() to finish a request, and
kiocb_done() will call io_put_kbuf() in case a provided buffer was used
for the request. Provided buffers are not supported for writes, hence
NULL is being passed in. This normally works fine, as io_put_kbuf()
won't actually use the value unless REQ_F_BUFFER_RING or
REQ_F_BUFFER_SELECTED is set in the request flags. But depending on
compiler (or whether or not CONFIG_CC_OPTIMIZE_FOR_SIZE is set), that
may be done even though the value is never used. This will then cause a
NULL pointer dereference.
Make it a bit more obvious and check for a NULL io_br_sel, and don't
even bother calling io_put_kbuf() for that case.
Fixes: 5fda51255439 ("io_uring/kbuf: switch to storing struct io_buffer_list locally")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There might be many reasons why a user is resizing a ring, e.g. moving
to huge pages or for some memory compaction using IORING_SETUP_NO_MMAP.
Don't bypass resizing, the user will definitely be surprised seeing 0
while the rings weren't actually moved to a new place.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|