<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/include/linux/sched/signal.h, branch linux-6.18.y</title>
<subtitle>Hosts the 0x221E linux distro kernel.</subtitle>
<id>https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.18.y</id>
<link rel='self' href='https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.18.y'/>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/'/>
<updated>2025-09-10T17:44:51Z</updated>
<entry>
<title>cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs</title>
<updated>2025-09-10T17:44:51Z</updated>
<author>
<name>Yi Tao</name>
<email>escape@linux.alibaba.com</email>
</author>
<published>2025-09-10T06:59:35Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=0568f89d4fb82d98001baeb870e92f43cd1f7317'/>
<id>urn:sha1:0568f89d4fb82d98001baeb870e92f43cd1f7317</id>
<content type='text'>
The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.

To avoid affecting other users, the per threadgroup rwsem is only used
when the favordynmods is enabled.

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct-&gt;group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct-&gt;group_rwsem to protect thread group integrity. In commit
&lt;1ed1328792ff&gt; ("sched, cgroup: replace signal_struct-&gt;group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&amp;g_rwsem)    | cgroup_lock()
down_write(&amp;p_rwsem)        | down_read(&amp;p_rwsem)    | down_write(&amp;g_rwsem)
critical section            | critical section       | critical section
up_write(&amp;p_rwsem)          | up_read(&amp;p_rwsem)      | up_write(&amp;g_rwsem)
cgroup_unlock()             | up_read(&amp;g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct-&gt;group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.

tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to
    pr_warn_once().

Signed-off-by: Yi Tao &lt;escape@linux.alibaba.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>posix-timers: Provide a mechanism to allocate a given timer ID</title>
<updated>2025-03-13T11:07:18Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2025-03-11T22:07:44Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=ec2d0c04624b3c8a7eb1682e006717fa20cfbe24'/>
<id>urn:sha1:ec2d0c04624b3c8a7eb1682e006717fa20cfbe24</id>
<content type='text'>
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
with the same timer ID on restore. It uses sys_timer_create() and relies on
the monotonic increasing timer ID provided by this syscall. It creates and
deletes timers until the desired ID is reached. This is can loop for a long
time, when the checkpointed process had a very sparse timer ID range.

It has been debated to implement a new syscall to allow the creation of
timers with a given timer ID, but that's tideous due to the 32/64bit compat
issues of sigevent_t and of dubious value.

The restore mechanism of CRIU creates the timers in a state where all
threads of the restored process are held on a barrier and cannot issue
syscalls. That means the restorer task has exclusive control.

This allows to address this issue with a prctl() so that the restorer
thread can do:

   if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))
      goto linear_mode;
   create_timers_with_explicit_ids();
   prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);
   
This is backwards compatible because the prctl() fails on older kernels and
CRIU can fall back to the linear timer ID mechanism. CRIU versions which do
not know about the prctl() just work as before.

Implement the prctl() and modify timer_create() so that it copies the
requested timer ID from userspace by utilizing the existing timer_t
pointer, which is used to copy out the allocated timer ID on success.

If the prctl() is disabled, which it is by default, timer_create() works as
before and does not try to read from the userspace pointer.

There is no problem when a broken or rogue user space application enables
the prctl(). If the user space pointer does not contain a valid ID, then
timer_create() fails. If the data is not initialized, but constains a
random valid ID, timer_create() will create that random timer ID or fail if
the ID is already given out. 
 
As CRIU must use the raw syscall to avoid manipulating the internal state
of the restored process, this has no library dependencies and can be
adopted by CRIU right away.

Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
the create/delete method. With the prctl() it takes 3 microseconds.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Reviewed-by: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Tested-by: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx

</content>
</entry>
<entry>
<title>posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t</title>
<updated>2025-03-13T11:07:17Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2025-03-08T16:48:36Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=feb864ee99a2d8a22800342388401f3a3b90d42b'/>
<id>urn:sha1:feb864ee99a2d8a22800342388401f3a3b90d42b</id>
<content type='text'>
The global hash_lock protecting the posix timer hash table can be heavily
contended especially when there is an extensive linear search for a timer
ID.

Timer IDs are handed out by monotonically increasing next_posix_timer_id
and then validating that there is no timer with the same ID in the hash
table. Both operations happen with the global hash lock held.

To reduce the hash lock contention the hash will be reworked to a scaled
hash with per bucket locks, which requires to handle the ID counter
lockless.

Prepare for this by making next_posix_timer_id an atomic_t, which can be
used lockless with atomic_inc_return().

[ tglx: Adopted from Eric's series, massaged change log and simplified it ]

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Acked-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155624.151545978@linutronix.de


</content>
</entry>
<entry>
<title>signal: Provide ignored_posix_timers list</title>
<updated>2024-11-07T01:14:45Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2024-11-05T08:14:49Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=69f032c92cf883ea74a4b69ba3d91317aa6f174e'/>
<id>urn:sha1:69f032c92cf883ea74a4b69ba3d91317aa6f174e</id>
<content type='text'>
To prepare for handling posix timer signals on sigaction(SIG_IGN) properly,
add a list to task::signal.

This list will be used to queue posix timers so their signal can be
requeued when SIG_IGN is lifted later.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/all/20241105064213.920101900@linutronix.de


</content>
</entry>
<entry>
<title>signal: Cleanup unused posix-timer leftovers</title>
<updated>2024-11-07T01:14:44Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2024-11-05T08:14:46Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=c2a4796a154bb952be1106911841aab2c8c17c4d'/>
<id>urn:sha1:c2a4796a154bb952be1106911841aab2c8c17c4d</id>
<content type='text'>
Remove the leftovers of sigqueue preallocation as it's not longer used.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/all/20241105064213.786506636@linutronix.de

</content>
</entry>
<entry>
<title>signal: Refactor send_sigqueue()</title>
<updated>2024-11-07T01:14:44Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2024-11-05T08:14:42Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=0360ed14d9826678a50fa2b873e522a24cd3c018'/>
<id>urn:sha1:0360ed14d9826678a50fa2b873e522a24cd3c018</id>
<content type='text'>
To handle posix timers which have their signal ignored via SIG_IGN properly
it is required to requeue a ignored signal for delivery when SIG_IGN is
lifted so the timer gets rearmed.

Split the required code out of send_sigqueue() so it can be reused in
context of sigaction().

While at it rename send_sigqueue() to posixtimer_send_sigqueue() so its
clear what this is about.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/all/20241105064213.586453412@linutronix.de

</content>
</entry>
<entry>
<title>posix-timers: Cure si_sys_private race</title>
<updated>2024-10-29T10:43:18Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2024-10-01T08:42:03Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=4febce44cfebcb490b196d5d10ae9f403ca4c956'/>
<id>urn:sha1:4febce44cfebcb490b196d5d10ae9f403ca4c956</id>
<content type='text'>
The si_sys_private member of the siginfo which is embedded in the
preallocated sigqueue is used by the posix timer code to decide whether a
timer must be reprogrammed on signal delivery.

The handling of this is racy as a long standing comment in that code
documents. It is modified with the timer lock held, but without sighand
lock being held. The actual signal delivery code checks for it under
sighand lock without holding the timer lock.

Hand the new value to send_sigqueue() as argument and store it with sighand
lock held. This is an intermediate change to address this issue.

The arguments to this function will be cleanup in subsequent changes.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/all/20241001083835.434338954@linutronix.de

</content>
</entry>
<entry>
<title>signal: Remove task argument from dequeue_signal()</title>
<updated>2024-07-29T19:57:35Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2024-06-10T16:42:33Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=a2b80ce87a87fc18c594e74d13031d5e347b69cb'/>
<id>urn:sha1:a2b80ce87a87fc18c594e74d13031d5e347b69cb</id>
<content type='text'>
The task pointer which is handed to dequeue_signal() is always current. The
argument along with the first comment about signalfd in that function is
confusing at best. Remove it and use current internally.

Update the stale comment for dequeue_signal() while at it.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
</content>
</entry>
<entry>
<title>posix-timers: Convert timer list to hlist</title>
<updated>2024-07-29T19:57:35Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2024-06-10T16:42:28Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=52dea0a15cc888e89f0144dca7b712fb56d12a28'/>
<id>urn:sha1:52dea0a15cc888e89f0144dca7b712fb56d12a28</id>
<content type='text'>
No requirement for a real list. Spare a few bytes.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Frederic Weisbecker &lt;frederic@kernel.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
</content>
</entry>
<entry>
<title>pidfd: exit: kill the no longer used thread_group_exited()</title>
<updated>2024-02-06T13:02:51Z</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2024-02-05T17:43:47Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=e2e8a142fbd988d658ccb3da1d6f4b26a39de0fd'/>
<id>urn:sha1:e2e8a142fbd988d658ccb3da1d6f4b26a39de0fd</id>
<content type='text'>
It was used by pidfd_poll() but now it has no callers.

If it finally finds a modular user we can revert this change, but note
that the comment above this helper and the changelog in 38fd525a4c61
("exit: Factor thread_group_exited out of pidfd_poll") are not accurate,
thread_group_exited() won't return true if all other threads have passed
exit_notify() and are zombies, it returns true only when all other threads
are completely gone. Not to mention that it can only work if the task
identified by @pid is a thread-group leader.

Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Link: https://lore.kernel.org/r/20240205174347.GA31461@redhat.com
Reviewed-by: Tycho Andersen &lt;tandersen@netflix.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
</feed>
