<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/arch/powerpc/include/asm/qspinlock.h, branch linux-6.2.y</title>
<subtitle>Hosts the 0x221E linux distro kernel.</subtitle>
<id>https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.2.y</id>
<link rel='self' href='https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.2.y'/>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/'/>
<updated>2022-12-02T06:48:50Z</updated>
<entry>
<title>powerpc/qspinlock: add compile-time tuning adjustments</title>
<updated>2022-12-02T06:48:50Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-26T09:59:32Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=0b2199841a7952d01a717b465df028b40b2cf3e9'/>
<id>urn:sha1:0b2199841a7952d01a717b465df028b40b2cf3e9</id>
<content type='text'>
This adds compile-time options that allow the EH lock hint bit to be
enabled or disabled, and adds some new options that may or may not
help matters. To help with experimentation and tuning.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20221126095932.1234527-18-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/qspinlock: allow lock stealing in trylock and lock fastpath</title>
<updated>2022-12-02T06:48:50Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-26T09:59:27Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=f61ab43cc1a6146d6eef7e0713a452c3677ad13e'/>
<id>urn:sha1:f61ab43cc1a6146d6eef7e0713a452c3677ad13e</id>
<content type='text'>
This change allows trylock to steal the lock. It also allows the initial
lock attempt to steal the lock rather than bailing out and going to the
slow path.

This gives trylock more strength: without this a continually-contended
lock will never permit a trylock to succeed. With this change, the
trylock has a small but non-zero chance.

It also gives the lock fastpath most of the benefit of passing the
reservation back through to the steal loop in the slow path without the
complexity.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20221126095932.1234527-13-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/qspinlock: store owner CPU in lock word</title>
<updated>2022-12-02T06:48:49Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-26T09:59:21Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=e1a31e7fd7130628cfd229253da2b4630e7a809c'/>
<id>urn:sha1:e1a31e7fd7130628cfd229253da2b4630e7a809c</id>
<content type='text'>
Store the owner CPU number in the lock word so it may be yielded to,
as powerpc's paravirtualised simple spinlocks do.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20221126095932.1234527-7-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/qspinlock: allow new waiters to steal the lock before queueing</title>
<updated>2022-12-02T06:48:49Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-26T09:59:19Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=6aa42f883c438ea132a28801bef3f86f3883d14c'/>
<id>urn:sha1:6aa42f883c438ea132a28801bef3f86f3883d14c</id>
<content type='text'>
Allow new waiters to "steal" the lock before queueing. That is, to
acquire it while other CPUs have queued.

This particularly helps paravirt performance when physical CPUs are
oversubscribed, by keeping the lock from becoming a strict FIFO and
vCPU preemption causing queue train wrecks.

The new __queued_spin_trylock_steal() function is put in qspinlock.h
to save having to move it, because it will be used there by a later
change.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20221126095932.1234527-5-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/qspinlock: convert atomic operations to assembly</title>
<updated>2022-12-02T06:48:49Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-26T09:59:18Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=b3a73b7db2b6cb3b2e5bfda5518a0e92230ef673'/>
<id>urn:sha1:b3a73b7db2b6cb3b2e5bfda5518a0e92230ef673</id>
<content type='text'>
This uses more optimal ll/sc style access patterns (rather than
cmpxchg), and also sets the EH=1 lock hint on those operations
which acquire ownership of the lock.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20221126095932.1234527-4-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.</title>
<updated>2022-12-02T06:48:49Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-26T09:59:17Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=4c93c2e4b9e8988511c06b9c042f23d4b8f593ad'/>
<id>urn:sha1:4c93c2e4b9e8988511c06b9c042f23d4b8f593ad</id>
<content type='text'>
The first 16 bits of the lock are only modified by the owner, and other
modifications always use atomic operations on the entire 32 bits, so
unlocks can use plain stores on the 16 bits. This is the same kind of
optimisation done by core qspinlock code.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20221126095932.1234527-3-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/qspinlock: add mcs queueing for contended waiters</title>
<updated>2022-12-02T06:48:49Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-26T09:59:16Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=84990b169557428c318df87b7836cd15f65b62dc'/>
<id>urn:sha1:84990b169557428c318df87b7836cd15f65b62dc</id>
<content type='text'>
This forms the basis of the qspinlock slow path.

Like generic qspinlocks and unlike the vanilla MCS algorithm, the lock
owner does not participate in the queue, only waiters. The first waiter
spins on the lock word, then when the lock is released it takes
ownership and unqueues the next waiter. This is how qspinlocks can be
implemented with the spinlock API -- lock owners don't need a node, only
waiters do.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20221126095932.1234527-2-npiggin@gmail.com

</content>
</entry>
<entry>
<title>powerpc/qspinlock: powerpc qspinlock implementation</title>
<updated>2022-12-02T06:48:02Z</updated>
<author>
<name>Nicholas Piggin</name>
<email>npiggin@gmail.com</email>
</author>
<published>2022-11-28T03:11:13Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=9f61521c7a284e799050cd2adacc9a611bd2b491'/>
<id>urn:sha1:9f61521c7a284e799050cd2adacc9a611bd2b491</id>
<content type='text'>
Add a powerpc specific implementation of queued spinlocks. This is the
build framework with a very simple (non-queued) spinlock implementation
to begin with. Later changes add queueing, and other features and
optimisations one-at-a-time. It is done this way to more easily see how
the queued spinlocks are built, and to make performance and correctness
bisects more useful.

Signed-off-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
[mpe: Drop paravirt.h &amp; processor.h changes to fix 32-bit build]
[mpe: Fix 32-bit build of qspinlock.o &amp; disallow GENERIC_LOCKBREAK per Nick]
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/CONLLQB6DCJU.2ZPOS7T6S5GRR@bobo
</content>
</entry>
<entry>
<title>locking/atomic: powerpc: move to ARCH_ATOMIC</title>
<updated>2021-05-26T11:20:52Z</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2021-05-25T14:02:26Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=9eaa82935dccb74a22e3da5045bed1dac59ad2b0'/>
<id>urn:sha1:9eaa82935dccb74a22e3da5045bed1dac59ad2b0</id>
<content type='text'>
We'd like all architectures to convert to ARCH_ATOMIC, as once all
architectures are converted it will be possible to make significant
cleanups to the atomics headers, and this will make it much easier to
generically enable atomic functionality (e.g. debug logic in the
instrumented wrappers).

As a step towards that, this patch migrates powerpc to ARCH_ATOMIC. The
arch code provides arch_{atomic,atomic64,xchg,cmpxchg}*(), and common
code wraps these with optional instrumentation to provide the regular
functions.

While atomic_try_cmpxchg_lock() is not part of the common atomic API, it
is given an `arch_` prefix for consistency.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Boqun Feng &lt;boqun.feng@gmail.com&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20210525140232.53872-28-mark.rutland@arm.com
</content>
</entry>
<entry>
<title>powerpc/qspinlock: Use generic smp_cond_load_relaxed</title>
<updated>2021-03-29T01:48:46Z</updated>
<author>
<name>Davidlohr Bueso</name>
<email>dave@stgolabs.net</email>
</author>
<published>2021-03-09T01:59:50Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=deb9b13eb2571fbde164ae012c77985fd14f2f02'/>
<id>urn:sha1:deb9b13eb2571fbde164ae012c77985fd14f2f02</id>
<content type='text'>
49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.

1. locktorture.

This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.

+=========+==========+==========+=======+
| # tasks | vanilla  | dirty    | %diff |
+=========+==========+==========+=======+
| 2       | 46718565 | 48751350 | 4.35  |
+---------+----------+----------+-------+
| 4       | 51740198 | 50369082 | -2.65 |
+---------+----------+----------+-------+
| 8       | 63756510 | 62568821 | -1.86 |
+---------+----------+----------+-------+
| 16      | 67824531 | 70966546 | 4.63  |
+---------+----------+----------+-------+
| 32      | 53843519 | 61155508 | 13.58 |
+---------+----------+----------+-------+
| 64      | 53005778 | 53104412 | 0.18  |
+---------+----------+----------+-------+
| 128     | 53331980 | 54606910 | 2.39  |
+=========+==========+==========+=======+

2. sockperf (tcp throughput)

Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:

		     simple-spinlock           vanilla			dirty
Hmean     14        73.50 (   0.00%)       54.44 * -25.93%*       73.45 * -0.07%*
Hmean     100      654.47 (   0.00%)      385.61 * -41.08%*      771.43 * 17.87%*
Hmean     300     2719.39 (   0.00%)     2181.67 * -19.77%*     2666.50 * -1.94%*
Hmean     500     4400.59 (   0.00%)     3390.77 * -22.95%*     4322.14 * -1.78%*
Hmean     850     6726.21 (   0.00%)     5264.03 * -21.74%*     6863.12 * 2.04%*

3. dbench (tmpfs)

Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
				     vanilla                dirty
Amean     latency-1          1.67 (   0.00%)        1.67 *   0.09%*
Amean     latency-2          2.15 (   0.00%)        2.08 *   3.36%*
Amean     latency-4          2.50 (   0.00%)        2.56 *  -2.27%*
Amean     latency-8          2.49 (   0.00%)        2.48 *   0.31%*
Amean     latency-16         2.69 (   0.00%)        2.72 *  -1.37%*
Amean     latency-32         2.96 (   0.00%)        3.04 *  -2.60%*
Amean     latency-64         7.78 (   0.00%)        8.17 *  -5.07%*
Amean     latency-512      186.91 (   0.00%)      186.41 *   0.27%*

For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:

			     vanilla                dirty
Hmean     1        849.13 (   0.00%)      851.51 *   0.28%*
Hmean     2       1664.03 (   0.00%)     1663.94 *  -0.01%*
Hmean     4       3073.70 (   0.00%)     3104.29 *   1.00%*
Hmean     8       5624.02 (   0.00%)     5694.16 *   1.25%*
Hmean     16      9169.49 (   0.00%)     9324.43 *   1.69%*
Hmean     32     11969.37 (   0.00%)    12127.09 *   1.32%*
Hmean     64     15021.12 (   0.00%)    15243.14 *   1.48%*
Hmean     512    14891.27 (   0.00%)    15162.11 *   1.82%*

Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.

Fixes: 49a7d46a06c3 ("powerpc: Implement smp_cond_load_relaxed()")
Acked-by: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Signed-off-by: Davidlohr Bueso &lt;dbueso@suse.de&gt;
Signed-off-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Link: https://lore.kernel.org/r/20210318204702.71417-1-dave@stgolabs.net
</content>
</entry>
</feed>
