<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/mm/percpu-km.c, branch linux-6.11.y</title>
<subtitle>Hosts the 0x221E linux distro kernel.</subtitle>
<id>https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.11.y</id>
<link rel='self' href='https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.11.y'/>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/'/>
<updated>2021-07-04T18:30:17Z</updated>
<entry>
<title>percpu: flush tlb in pcpu_reclaim_populated()</title>
<updated>2021-07-04T18:30:17Z</updated>
<author>
<name>Dennis Zhou</name>
<email>dennis@kernel.org</email>
</author>
<published>2021-07-03T03:49:57Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=93274f1dd6b0a615b299beddf99871fe81f91275'/>
<id>urn:sha1:93274f1dd6b0a615b299beddf99871fe81f91275</id>
<content type='text'>
Prior to "percpu: implement partial chunk depopulation",
pcpu_depopulate_chunk() was called only on the destruction path. This
meant the virtual address range was on its way back to vmalloc which
will handle flushing the tlbs for us.

However, with pcpu_reclaim_populated(), we are now calling
pcpu_depopulate_chunk() during the active lifecycle of a chunk.
Therefore, we need to flush the tlb as well otherwise we can end up
accessing the wrong page through an invalid tlb mapping as reported in
[1].

[1] https://lore.kernel.org/lkml/20210702191140.GA3166599@roeck-us.net/

Fixes: f183324133ea ("percpu: implement partial chunk depopulation")
Reported-and-tested-by: Guenter Roeck &lt;linux@roeck-us.net&gt;
Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
</content>
</entry>
<entry>
<title>percpu: rework memcg accounting</title>
<updated>2021-06-05T20:43:15Z</updated>
<author>
<name>Roman Gushchin</name>
<email>guro@fb.com</email>
</author>
<published>2021-06-03T01:09:31Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=faf65dde844affa9e360ccaa4bd231c2a04b87ea'/>
<id>urn:sha1:faf65dde844affa9e360ccaa4bd231c2a04b87ea</id>
<content type='text'>
The current implementation of the memcg accounting of the percpu
memory is based on the idea of having two separate sets of chunks for
accounted and non-accounted memory. This approach has an advantage
of not wasting any extra memory for memcg data for non-accounted
chunks, however it complicates the code and leads to a higher chunks
number due to a lower chunk utilization.

Instead of having two chunk types it's possible to declare all* chunks
memcg-aware unless the kernel memory accounting is disabled globally
by a boot option. The size of objcg_array is usually small in
comparison to chunks themselves (it obviously depends on the number of
CPUs), so even if some chunk will have no accounted allocations, the
memory waste isn't significant and will likely be compensated by
a higher chunk utilization. Also, with time more and more percpu
allocations will likely become accounted.

* The first chunk is initialized before the memory cgroup subsystem,
  so we don't know for sure whether we need to allocate obj_cgroups.
  Because it's small, let's make it free for use. Then we don't need
  to allocate obj_cgroups for it.

Signed-off-by: Roman Gushchin &lt;guro@fb.com&gt;
Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
</content>
</entry>
<entry>
<title>percpu: implement partial chunk depopulation</title>
<updated>2021-04-21T18:17:40Z</updated>
<author>
<name>Roman Gushchin</name>
<email>guro@fb.com</email>
</author>
<published>2021-04-08T03:57:36Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=f183324133ea535db4127f9fad3e19725ca88bf3'/>
<id>urn:sha1:f183324133ea535db4127f9fad3e19725ca88bf3</id>
<content type='text'>
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.

The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.

This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:
--

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" &gt; percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    mkdir percpu_test/cg_"${i}"
    for j in `seq 1 10`; do
	mkdir percpu_test/cg_"${i}"_"${j}"
    done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    for j in `seq 1 10`; do
	rmdir percpu_test/cg_"${i}"_"${j}"
    done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
    rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
  vanilla:
    ./percpu_test.sh
    Percpu:             7488 kB
    Percpu:           481152 kB
    Percpu:           481152 kB

  with this patchset applied:
    ./percpu_test.sh
    Percpu:             7488 kB
    Percpu:           481408 kB
    Percpu:           135552 kB

The total size of the percpu memory was reduced by more than 3.5 times.

This patch:

This patch implements partial depopulation of percpu chunks.

As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.

The depopulation is scheduled on the free path if the chunk is all of
the following:
  1) has more than 1/4 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk-&gt;isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.

pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.

Signed-off-by: Roman Gushchin &lt;guro@fb.com&gt;
Co-developed-by: Dennis Zhou &lt;dennis@kernel.org&gt;
Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
Tested-by: Pratik Sampat &lt;psampat@linux.ibm.com&gt;
Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg/percpu: account percpu memory to memory cgroups</title>
<updated>2020-08-12T17:57:55Z</updated>
<author>
<name>Roman Gushchin</name>
<email>guro@fb.com</email>
</author>
<published>2020-08-12T01:30:17Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=3c7be18ac9a06bc67196bfdabb7c21e1bbacdc13'/>
<id>urn:sha1:3c7be18ac9a06bc67196bfdabb7c21e1bbacdc13</id>
<content type='text'>
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user.  Also, some cgroup internals (e.g.  memory controller
statistics) can be quite large.  On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation.  Similar to the slab memory, percpu memory should be accounted
by default.

To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow.  Let's introduce two types of percpu
chunks: root and memcg.  What makes memcg chunks different is an
additional space allocated to store memcg membership information.  If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.

To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.

[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
  Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com

Signed-off-by: Roman Gushchin &lt;guro@fb.com&gt;
Signed-off-by: Bixuan Cui &lt;cuibixuan@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Dennis Zhou &lt;dennis@kernel.org&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Tobin C. Harding &lt;tobin@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Waiman Long &lt;longman@redhat.com&gt;
Cc: Bixuan Cui &lt;cuibixuan@huawei.com&gt;
Cc: Michal Koutný &lt;mkoutny@suse.com&gt;
Cc: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428</title>
<updated>2019-06-05T15:37:16Z</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2019-06-01T08:08:42Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=55716d26439f5c4008b0bcb7f17d1f7c0d8fbcfc'/>
<id>urn:sha1:55716d26439f5c4008b0bcb7f17d1f7c0d8fbcfc</id>
<content type='text'>
Based on 1 normalized pattern(s):

  this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Reviewed-by: Armijn Hemel &lt;armijn@tjaldur.nl&gt;
Reviewed-by: Allison Randal &lt;allison@lohutok.net&gt;
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE</title>
<updated>2019-03-13T19:25:31Z</updated>
<author>
<name>Dennis Zhou</name>
<email>dennis@kernel.org</email>
</author>
<published>2019-02-13T19:10:30Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=b239f7daf5530f562000bf55f02cc8028703f507'/>
<id>urn:sha1:b239f7daf5530f562000bf55f02cc8028703f507</id>
<content type='text'>
Previously, block size was flexible based on the constraint that the
GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) &gt; 1. However, this carried the
overhead that keeping a floating number of populated free pages required
scanning over the free regions of a chunk.

Setting the block size to be fixed at PAGE_SIZE lets us know when an
empty page becomes used as we will break a full contig_hint of a block.
This means we no longer have to scan the whole chunk upon breaking a
contig_hint which empty page management piggybacked off. A later patch
takes advantage of this to optimize the allocation path by only scanning
forward using the scan_hint introduced later too.

Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
Reviewed-by: Peng Fan &lt;peng.fan@nxp.com&gt;
</content>
</entry>
<entry>
<title>percpu: km: no need to consider pcpu_group_offsets[0]</title>
<updated>2019-02-26T21:47:58Z</updated>
<author>
<name>Peng Fan</name>
<email>peng.fan@nxp.com</email>
</author>
<published>2019-02-24T13:13:50Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=1b046b445c0f856c3c1eed38a348bd87cc2dc730'/>
<id>urn:sha1:1b046b445c0f856c3c1eed38a348bd87cc2dc730</id>
<content type='text'>
percpu-km is used on UP systems which only has one group,
so the group offset will be always 0, there is no need
to subtract pcpu_group_offsets[0] when assigning chunk-&gt;base_addr

Signed-off-by: Peng Fan &lt;peng.fan@nxp.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
</content>
</entry>
<entry>
<title>percpu: convert spin_lock_irq to spin_lock_irqsave.</title>
<updated>2018-12-18T17:04:08Z</updated>
<author>
<name>Dennis Zhou</name>
<email>dennis@kernel.org</email>
</author>
<published>2018-12-18T16:42:27Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=6ab7d47bcbf0144a8cb81536c2cead4cde18acfe'/>
<id>urn:sha1:6ab7d47bcbf0144a8cb81536c2cead4cde18acfe</id>
<content type='text'>
From Michael Cree:
  "Bisection lead to commit b38d08f3181c ("percpu: restructure
   locking") as being the cause of lockups at initial boot on
   the kernel built for generic Alpha.

   On a suggestion by Tejun Heo that:

   So, the only thing I can think of is that it's calling
   spin_unlock_irq() while irq handling isn't set up yet.
   Can you please try the followings?

   1. Convert all spin_[un]lock_irq() to
      spin_lock_irqsave/unlock_irqrestore()."

Fixes: b38d08f3181c ("percpu: restructure locking")
Reported-and-tested-by: Michael Cree &lt;mcree@orcon.net.nz&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Dennis Zhou &lt;dennis@kernel.org&gt;
</content>
</entry>
<entry>
<title>percpu: allow select gfp to be passed to underlying allocators</title>
<updated>2018-02-18T13:33:01Z</updated>
<author>
<name>Dennis Zhou</name>
<email>dennisszhou@gmail.com</email>
</author>
<published>2018-02-16T18:09:58Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=554fef1c39ee148623a496e04569dabb11463406'/>
<id>urn:sha1:554fef1c39ee148623a496e04569dabb11463406</id>
<content type='text'>
The prior patch added support for passing gfp flags through to the
underlying allocators. This patch allows users to pass along gfp flags
(currently only __GFP_NORETRY and __GFP_NOWARN) to the underlying
allocators. This should allow users to decide if they are ok with
failing allocations recovering in a more graceful way.

Additionally, gfp passing was done as additional flags in the previous
patch. Instead, change this to caller passed semantics. GFP_KERNEL is
also removed as the default flag. It continues to be used for internally
caused underlying percpu allocations.

V2:
Removed gfp_percpu_mask in favor of doing it inline.
Removed GFP_KERNEL as a default flag for __alloc_percpu_gfp.

Signed-off-by: Dennis Zhou &lt;dennisszhou@gmail.com&gt;
Suggested-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
<entry>
<title>percpu: add __GFP_NORETRY semantics to the percpu balancing path</title>
<updated>2018-02-18T13:33:00Z</updated>
<author>
<name>Dennis Zhou</name>
<email>dennisszhou@gmail.com</email>
</author>
<published>2018-02-16T18:07:19Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=47504ee04b9241548ae2c28be7d0b01cff3b7aa6'/>
<id>urn:sha1:47504ee04b9241548ae2c28be7d0b01cff3b7aa6</id>
<content type='text'>
Percpu memory using the vmalloc area based chunk allocator lazily
populates chunks by first requesting the full virtual address space
required for the chunk and subsequently adding pages as allocations come
through. To ensure atomic allocations can succeed, a workqueue item is
used to maintain a minimum number of empty pages. In certain scenarios,
such as reported in [1], it is possible that physical memory becomes
quite scarce which can result in either a rather long time spent trying
to find free pages or worse, a kernel panic.

This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them
through to the underlying allocators. This should prevent any
unnecessary panics potentially caused by the workqueue item. The passing
of gfp around is as additional flags rather than a full set of flags.
The next patch will change these to caller passed semantics.

V2:
Added const modifier to gfp flags in the balance path.
Removed an extra whitespace.

[1] https://lkml.org/lkml/2018/2/12/551

Signed-off-by: Dennis Zhou &lt;dennisszhou@gmail.com&gt;
Suggested-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
</content>
</entry>
</feed>
