kernel/arch/powerpc/lib/qspinlock.c, branch linux-rolling-stable

powerpc/qspinlock: Add spinlock contention tracepoint

2025-09-01T08:08:12Z

Add a lock contention tracepoint in the queued spinlock slowpath. Also add the __lockfunc annotation so that in_lock_functions() works as expected. Signed-off-by: Nysal Jan K.A. Reviewed-by: Shrikanth Hegde Signed-off-by: Madhavan Srinivasan Link: https://patch.msgid.link/20250731061856.1858898-1-nysal@linux.ibm.com

powerpc/qspinlock: Fix deadlock in MCS queue

2024-08-29T05:12:51Z

If an interrupt occurs in queued_spin_lock_slowpath() after we increment qnodesp->count and before node->lock is initialized, another CPU might see stale lock values in get_tail_qnode(). If the stale lock value happens to match the lock on that CPU, then we write to the "next" pointer of the wrong qnode. This causes a deadlock as the former CPU, once it becomes the head of the MCS queue, will spin indefinitely until it's "next" pointer is set by its successor in the queue. Running stress-ng on a 16 core (16EC/16VP) shared LPAR, results in occasional lockups similar to the following: $ stress-ng --all 128 --vm-bytes 80% --aggressive \ --maximize --oomable --verify --syslog \ --metrics --times --timeout 5m watchdog: CPU 15 Hard LOCKUP ...... NIP [c0000000000b78f4] queued_spin_lock_slowpath+0x1184/0x1490 LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90 Call Trace: 0xc000002cfffa3bf0 (unreliable) _raw_spin_lock+0x6c/0x90 raw_spin_rq_lock_nested.part.135+0x4c/0xd0 sched_ttwu_pending+0x60/0x1f0 __flush_smp_call_function_queue+0x1dc/0x670 smp_ipi_demux_relaxed+0xa4/0x100 xive_muxed_ipi_action+0x20/0x40 __handle_irq_event_percpu+0x80/0x240 handle_irq_event_percpu+0x2c/0x80 handle_percpu_irq+0x84/0xd0 generic_handle_irq+0x54/0x80 __do_irq+0xac/0x210 __do_IRQ+0x74/0xd0 0x0 do_IRQ+0x8c/0x170 hardware_interrupt_common_virt+0x29c/0x2a0 --- interrupt: 500 at queued_spin_lock_slowpath+0x4b8/0x1490 ...... NIP [c0000000000b6c28] queued_spin_lock_slowpath+0x4b8/0x1490 LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90 --- interrupt: 500 0xc0000029c1a41d00 (unreliable) _raw_spin_lock+0x6c/0x90 futex_wake+0x100/0x260 do_futex+0x21c/0x2a0 sys_futex+0x98/0x270 system_call_exception+0x14c/0x2f0 system_call_vectored_common+0x15c/0x2ec The following code flow illustrates how the deadlock occurs. For the sake of brevity, assume that both locks (A and B) are contended and we call the queued_spin_lock_slowpath() function. CPU0 CPU1 ---- ---- spin_lock_irqsave(A) | spin_unlock_irqrestore(A) | spin_lock(B) | | | ▼ | id = qnodesp->count++; | (Note that nodes[0].lock == A) | | | ▼ | Interrupt | (happens before "nodes[0].lock = B") | | | ▼ | spin_lock_irqsave(A) | | | ▼ | id = qnodesp->count++ | nodes[1].lock = A | | | ▼ | Tail of MCS queue | | spin_lock_irqsave(A) ▼ | Head of MCS queue ▼ | CPU0 is previous tail ▼ | Spin indefinitely ▼ (until "nodes[1].next != NULL") prev = get_tail_qnode(A, CPU0) | ▼ prev == &qnodes[CPU0].nodes[0] (as qnodes[CPU0].nodes[0].lock == A) | ▼ WRITE_ONCE(prev->next, node) | ▼ Spin indefinitely (until nodes[0].locked == 1) Thanks to Saket Kumar Bhaskar for help with recreating the issue Fixes: 84990b169557 ("powerpc/qspinlock: add mcs queueing for contended waiters") Cc: stable@vger.kernel.org # v6.2+ Reported-by: Geetika Moolchandani Reported-by: Vaishnavi Bhat Reported-by: Jijo Varghese Signed-off-by: Nysal Jan K.A. Reviewed-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://msgid.link/20240829022830.1164355-1-nysal@linux.ibm.com

powerpc/qspinlock: Rename yield_propagate_owner tunable

2023-10-20T11:43:34Z

Rename yield_propagate_owner to yield_sleepy_owner, which better describes what it does (what, not how). Signed-off-by: Nicholas Piggin Tested-by: Shrikanth Hegde Reviewed-by: "Nysal Jan K.A" Signed-off-by: Michael Ellerman Link: https://msgid.link/20231016124305.139923-7-npiggin@gmail.com

powerpc/qspinlock: Propagate sleepy if previous waiter is preempted

2023-10-20T11:43:34Z

The sleepy (aka lock-owner-is-preempted) condition is propagated down the queue by each waiter. If a waiter becomes preempted, it can no longer propagate sleepy. To allow subsequent waiters to yield to the lock owner, also check the lock owner in this case. Signed-off-by: Nicholas Piggin Tested-by: Shrikanth Hegde Reviewed-by: "Nysal Jan K.A" Signed-off-by: Michael Ellerman Link: https://msgid.link/20231016124305.139923-6-npiggin@gmail.com

powerpc/qspinlock: don't propagate the not-sleepy state

2023-10-20T11:43:34Z

To simplify things, don't propagate the not-sleepy condition back down the queue. Instead, have the waiters clear their own node->sleepy when finding the lock owner is not preempted. Signed-off-by: Nicholas Piggin Tested-by: Shrikanth Hegde Reviewed-by: "Nysal Jan K.A" Signed-off-by: Michael Ellerman Link: https://msgid.link/20231016124305.139923-5-npiggin@gmail.com

powerpc/qspinlock: propagate owner preemptedness rather than CPU number

2023-10-20T11:43:34Z

Rather than propagating the CPU number of the preempted lock owner, just propagate whether the owner was preempted. Waiters must read the lock value when yielding to it to prevent races anyway, so might as well always load the owner CPU from the lock. To further simplify the code, also don't propagate the -1 (or sleepy=false in the new scheme) down the queue. Instead, have the waiters clear it themselves when finding the lock owner is not preempted. Signed-off-by: Nicholas Piggin Tested-by: Shrikanth Hegde Reviewed-by: "Nysal Jan K.A" Signed-off-by: Michael Ellerman Link: https://msgid.link/20231016124305.139923-4-npiggin@gmail.com

powerpc/qspinlock: stop queued waiters trying to set lock sleepy

2023-10-20T11:43:34Z

If a queued waiter notices the lock owner or the previous waiter has been preempted, it attempts to mark the lock sleepy, but it does this as a try-set operation using the original lock value it got when queueing, which will become stale as the queue progresses, and the try-set will fail. Drop this and just set the sleepy seen clock. Signed-off-by: Nicholas Piggin Tested-by: Shrikanth Hegde Reviewed-by: "Nysal Jan K.A" Signed-off-by: Michael Ellerman Link: https://msgid.link/20231016124305.139923-3-npiggin@gmail.com

powerpc/qspinlock: Fix stale propagated yield_cpu

2023-10-18T10:07:21Z

yield_cpu is a sample of a preempted lock holder that gets propagated back through the queue. Queued waiters use this to yield to the preempted lock holder without continually sampling the lock word (which would defeat the purpose of MCS queueing by bouncing the cache line). The problem is that yield_cpu can become stale. It can take some time to be passed down the chain, and if any queued waiter gets preempted then it will cease to propagate the yield_cpu to later waiters. This can result in yielding to a CPU that no longer holds the lock, which is bad, but particularly if it is currently in H_CEDE (idle), then it appears to be preempted and some hypervisors (PowerVM) can cause very long H_CONFER latencies waiting for H_CEDE wakeup. This results in latency spikes and hard lockups on oversubscribed partitions with lock contention. This is a minimal fix. Before yielding to yield_cpu, sample the lock word to confirm yield_cpu is still the owner, and bail out of it is not. Thanks to a bunch of people who reported this and tracked down the exact problem using tracepoints and dispatch trace logs. Fixes: 28db61e207ea ("powerpc/qspinlock: allow propagation of yield CPU down the queue") Cc: stable@vger.kernel.org # v6.2+ Reported-by: Srikar Dronamraju Reported-by: Laurent Dufour Reported-by: Shrikanth Hegde Debugged-by: "Nysal Jan K.A" Signed-off-by: Nicholas Piggin Tested-by: Shrikanth Hegde Signed-off-by: Michael Ellerman Link: https://msgid.link/20231016124305.139923-2-npiggin@gmail.com

powerpc: qspinlock: Enforce qnode writes prior to publishing to queue

2023-06-21T05:13:57Z

Annotate the release barrier and memory clobber (in effect, producing a compiler barrier) in the publish_tail_cpu call. These barriers have the effect of ensuring that qnode attributes are all written to prior to publish the node to the waitqueue. Even while the initial write to the 'locked' attribute is guaranteed to terminate prior to the node being visible, KCSAN still complains that the write is reorderable by the compiler. Issue a kcsan_release() to inform KCSAN of the release barrier contained in publish_tail_cpu(). Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://msgid.link/20230510033117.1395895-3-rmclure@linux.ibm.com

powerpc: qspinlock: Mark accesses to qnode lock checks

2023-06-21T05:13:57Z

The powerpc implementation of qspinlocks will both poll and spin on the bitlock guarding a qnode. Mark these accesses with READ_ONCE to convey to KCSAN that polling is intentional here. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin Signed-off-by: Michael Ellerman Link: https://msgid.link/20230510033117.1395895-2-rmclure@linux.ibm.com