<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/include/linux/damon.h, branch linux-6.17.y</title>
<subtitle>Hosts the 0x221E linux distro kernel.</subtitle>
<id>https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.17.y</id>
<link rel='self' href='https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-6.17.y'/>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/'/>
<updated>2025-09-13T20:05:36Z</updated>
<entry>
<title>mm/damon/core: introduce damon_call_control-&gt;dealloc_on_cancel</title>
<updated>2025-09-13T20:05:36Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-09-08T20:15:12Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=e6a0deb6fa5b0fc134ee2aa127d1cfc9456d8445'/>
<id>urn:sha1:e6a0deb6fa5b0fc134ee2aa127d1cfc9456d8445</id>
<content type='text'>
Patch series "mm/damon/sysfs: fix refresh_ms control overwriting on
multi-kdamonds usages".

Automatic esssential DAMON/DAMOS status update feature of DAMON sysfs
interface (refresh_ms) is broken [1] for multiple DAMON contexts
(kdamonds) use case, since it uses a global single damon_call_control
object for all created DAMON contexts.  The fields of the object,
particularly the list field is over-written for the contexts and it makes
unexpected results including user-space hangup and kernel crashes [2]. 
Fix it by extending damon_call_control for the use case and updating the
usage on DAMON sysfs interface to use per-context dynamically allocated
damon_call_control object.


This patch (of 2):

When damon_call_control-&gt;repeat is set, damon_call() is executed
asynchronously, and is eventually canceled when kdamond finishes.  If the
damon_call_control object is dynamically allocated, finding the place to
deallocate the object is difficult.  Introduce a new damon_call_control
field, namely dealloc_on_cancel, to ask the kdamond deallocates those
dynamically allocated objects when those are canceled.

Link: https://lkml.kernel.org/r/20250908201513.60802-3-sj@kernel.org
Link: https://lkml.kernel.org/r/20250908201513.60802-2-sj@kernel.org
Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work")
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Yunjeong Mun &lt;yunjeong.mun@sk.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon/core: remove damon_callback</title>
<updated>2025-07-20T01:59:57Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-07-12T19:50:16Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=5add26c0a18636e8e9fe409d4591c8a36e1bf695'/>
<id>urn:sha1:5add26c0a18636e8e9fe409d4591c8a36e1bf695</id>
<content type='text'>
All damon_callback usages are replicated by damon_call() and damos_walk().
Time to say goodbye.  Remove damon_callback.

Link: https://lkml.kernel.org/r/20250712195016.151108-15-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon/core: add cleanup_target() ops callback</title>
<updated>2025-07-20T01:59:56Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-07-12T19:50:11Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=7114bc5e01cf393e1fdc97e10399eb9451b6af45'/>
<id>urn:sha1:7114bc5e01cf393e1fdc97e10399eb9451b6af45</id>
<content type='text'>
Some DAMON operation sets may need additional cleanup per target.  For
example, [f]vaddr need to put pids of each target.  Each user and core
logic is doing that redundantly.  Add another DAMON ops callback that will
be used for doing such cleanups in operations set layer.

[sj@kernel.org: add kernel-doc comment for damon_operations-&gt;cleanup_target]
  Link: https://lkml.kernel.org/r/20250715185239.89152-2-sj@kernel.org
[sj@kernel.org: remove damon_ctx-&gt;callback kernel-doc comment]
  Link: https://lkml.kernel.org/r/20250715185239.89152-3-sj@kernel.org
Link: https://lkml.kernel.org/r/20250712195016.151108-10-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon/core: introduce repeat mode damon_call()</title>
<updated>2025-07-20T01:59:54Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-07-12T19:50:04Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=43df7676e5508895c33b846d37cff9cf3b52674c'/>
<id>urn:sha1:43df7676e5508895c33b846d37cff9cf3b52674c</id>
<content type='text'>
damon_call() can be useful for reading or writing DAMON internal data for
one time.  A common pattern of DAMON core usage from DAMON modules is
doing such reads and writes repeatedly, for example, to periodically
update the DAMOS stats.  To do that with damon_call(), callers should call
damon_call() repeatedly, with their own delay loop.  Each caller doing
that is repetitive.  Introduce a repeat mode damon_call().  Callers can
use the mode by setting a new field in damon_call_control.  If the mode is
turned on, damon_call() returns success immediately, and DAMON repeats
invoking the callback function inside the kdamond main loop.

Link: https://lkml.kernel.org/r/20250712195016.151108-3-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon: accept parallel damon_call() requests</title>
<updated>2025-07-20T01:59:54Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-07-12T19:50:03Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=004ded6bee11b8ed463cdc54b89a4390f4b64f6d'/>
<id>urn:sha1:004ded6bee11b8ed463cdc54b89a4390f4b64f6d</id>
<content type='text'>
Patch series "mm/damon: remove damon_callback".

damon_callback was the only way for communicating with DAMON for contexts
running on its worker thread.  The interface is flexible and simple.  But
as DAMON evolves with more features, damon_callback has become somewhat
too old.  With runtime parameters update, for example, its lack of
synchronization support was found to be inconvenient.  Arguably it is also
not easy to use correctly since the callers should understand when each
callback is called, and implication of the return values from the
callbacks.

To replace it, damon_call() and damos_walk() are introduced.  And those
replaced a few damon_callback use cases.  Some use cases of damon_callback
such as parallel or repetitive DAMON internal data reading and additional
cleanups cannot simply be replaced by damon_call() and damos_walk(),
though.

To allow those replaceable, extend damon_call() for parallel and/or
repeated callbacks and modify the core/ops layers for additional resources
cleanup.  With the updates, replace the remaining damon_callback usages
and finally say goodbye to damon_callback.


This patch (of 14):

Calling damon_call() while it is serving for another parallel thread
immediately fails with -EBUSY.  The caller should call it again, later. 
Each caller implementing such retry logic would be redundant.  Accept
parallel damon_call() requests and do the wait instead of the caller.

Link: https://lkml.kernel.org/r/20250712195016.151108-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250712195016.151108-2-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon/core: add damos-&gt;migrate_dests field</title>
<updated>2025-07-20T01:59:48Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-07-09T00:59:32Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=aabc85ee33c883243f2c506a5d88963f2456faa6'/>
<id>urn:sha1:aabc85ee33c883243f2c506a5d88963f2456faa6</id>
<content type='text'>
Add a new field to 'struct damos', namely migrate_dests, to allow DAMON
API callers specify multiple migration destination nodes and their
weights.  Also update 'struct damos' creation and destruction functions
accordingly to initialize the new field and free up the API
caller-allocated buffers on those, respectively.

Link: https://lkml.kernel.org/r/20250709005952.17776-3-bijan311@gmail.com
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Bijan Tabatabai &lt;bijantabatab@micron.com&gt;
Reviewed-by: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Ravi Shankar Jonnalagadda &lt;ravis.opensrc@micron.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon: add struct damos_migrate_dests</title>
<updated>2025-07-20T01:59:48Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-07-09T00:59:31Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=a2c24eae5a15f79673eba2913d87d658a04830cf'/>
<id>urn:sha1:a2c24eae5a15f79673eba2913d87d658a04830cf</id>
<content type='text'>
Patch series "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold}
actions", v4.

A recent patchset automatically sets the interleave weight for each node
according to the node's maximum bandwidth [1].  In another thread, the
patch set's author, Joshua Hahn, wondered if/how thes weights should be
changed if the bandwidth utilization of the system changes [2].

This patch set adds the mechanism for dynamically changing how application
data is interleaved across nodes while leaving the policy of what the
interleave weights should be to userspace.  It does this by having the
migrate_{hot,cold} operating schemes interleave application data according
to the list of migration nodes and weights passed in via the DAMON sysfs
interface.  This functionality can be used to dynamically adjust how
folios are interleaved by having a userspace process adjust those weights.
If no specific destination nodes or weights are provided, the
migrate_{hot,cold} actions will only migrate folios to damos-&gt;target_nid
as before.

The algorithm used to interleave the folios is similar to the one used for
the weighted interleave mempolicy [3].  It uses the offset from which a
folio is mapped into a VMA to determine the node the folio should be
placed in.  This method is convenient because for a given set of
interleave weights, a folio has only one valid node it can be placed in,
limitng the amount of unnecessary data movement.  However, finding out how
a folio is mapped inside of a VMA requires a costly rmap walk when using a
paddr scheme.  As such, we have decided that this functionality makes more
sense as a vaddr scheme [4].  To this end, this patch set also adds vaddr
versions of the migrate_{hot,cold}.

Motivation
==========
There have been prior discussions about how changing the interleave
weights in response to the system's bandwidth utilization can be
beneficial [2].  However, currently the interleave weights only are
applied when data is allocated.  Migrating already allocated pages
according to the dynamically changing weights will better help balance the
bandwidth utilization across nodes.

As a toy example, imagine some application that uses 75% of the local
bandwidth.  Assuming sufficient capacity, when running alone, we want to
keep that application's data in local memory.  However, if a second
instance of that application begins, using the same amount of bandwidth,
it would be best to interleave the data of both processes to alleviate the
bandwidth pressure from the local node.  Likewise, when one of the
processes ends, the data should be moves back to local memory.

We imagine there would be a userspace application that would monitor
system performance characteristics, such as bandwidth utilization or
memory access latency, and uses that information to tune the interleave
weights.  Others seem to have come to a similar conclusion in previous
discussions [5].  We are currently working on a userspace program that
does this, but it is not quite ready to be published yet.

After the userspace application tunes the interleave weights, there must
be some mechanism that actually migrates pages to be consistent with those
weights.  This patchset is what provides this mechanism.

We believe DAMON is the correct venue for the interleaving mechanism for a
few reasons.  First, we noticed that we don't have to migrate all of the
application's pages to improve performance.  we just need to migrate the
frequently accessed pages.  DAMON's existing hotness traching is very
useful for this.  Second, DAMON's quota system can be used to ensure we
are not using too much bandwidth for migrations.  Finally, as Ying pointed
out [6], a complete solution must also handle when a memory node is at
capacity.  The existing migrate_cold action can be used in conjunction
with the functionality added in this patch set to provide that complete
solution.

Functionality Test
==================
Below is an example of this new functionality in use to confirm that these
patches behave as intended.

In this example, the user starts an application, alloc_data, which
allocates 1GB using the default memory policy (i.e.  allocate to local
memory) then sleeps.  Afterwards, we start DAMON to interleave the data at
a 1:1 ratio.  Using numastat, we show that DAMON has migrated the
application's data to match the new interleave ratio.

For this example, I modified the userspace damo tool [8] to write to the
migration_dest sysfs files.  I plan to upstream these changes when these
patches are merged.

  $ # Allocate the data initially
  $ ./alloc_data 1G &amp;
  [1] 6587
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 6587 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private    1027      0  1027
  -------  ------ ------ -----
  Total      1027      0  1027
  $ # Start DAMON to interleave data at a 1:1 ratio
  $ cat ./interleave_vaddr.yaml
  kdamonds:
  - contexts:
    - ops: vaddr
      addr_unit: null
      targets:
      - pid: 6587
        regions: []
      intervals:
        sample_us: 500 ms
        aggr_us: 5 s
        ops_update_us: 20 s
        intervals_goal:
          access_bp: 0 %
          aggrs: '0'
          min_sample_us: 0 ns
          max_sample_us: 0 ns
      nr_regions:
        min: '20'
        max: '50'
      schemes:
      - action: migrate_hot
        dests:
        - nid: 0
          weight: 1
        - nid: 1
          weight: 1
        access_pattern:
          sz_bytes:
            min: 0 B
            max: max
          nr_accesses:
            min: 0 %
            max: 100 %
          age:
            min: 0 ns
            max: max
  $ sudo ./damo/damo interleave_vaddr.yaml
  $ # Verify that DAMON has migrated data to match the 1:1 ratio
  $ numastat -c -p alloc_data

  Per-node process memory usage (in MBs) for PID 6587 (alloc_data)
           Node 0 Node 1 Total
           ------ ------ -----
  Huge          0      0     0
  Heap          0      0     0
  Stack         0      0     0
  Private     514    514  1027
  -------  ------ ------ -----
  Total       514    514  1027

Performance Test
================
Below is a simple example showing that interleaving application data using
these patches can improve application performance.  To do this, we run a
bandwidth intensive embedding reduction application [7].  This workload is
useful for this test because it reports the time it takes each iteration
to run and each iteration reuses the same allocation, allowing us to see
the benefits of the migration.

We evaluate this on a 128 core/256 thread AMD CPU with 72GB/s of local DDR
bandwidth and 26 GB/s of CXL bandwidth.

Before we start the workload, the system bandwidth utilization is low, so
we start with the interleave weights of 1:0, i.e.  allocating all data to
local memory.  When the workload beings, it saturates the local bandwidth,
making the page placement suboptimal.  To alleviate this, we modify the
interleave weights, triggering DAMON to migrate the workload's data.

We use the same interleave_vaddr.yaml file to setup DAMON, except we
configure it to begin with a 1:0 interleave ratio, and attach it to the
shell and its children processes.

  $ sudo ./damo/damo start interleave_vaddr.yaml --include_child_tasks &amp;
  $ &lt;path&gt;/eval_baseline -d amazon_All -c 255 -r 100
  &lt;clip startup output&gt;
  Eval Phase 3: Running Baseline...
  
  REPEAT # 0 Baseline Total time : 7323.54 ms
  REPEAT # 1 Baseline Total time : 7624.56 ms
  REPEAT # 2 Baseline Total time : 7619.61 ms
  REPEAT # 3 Baseline Total time : 7617.12 ms
  REPEAT # 4 Baseline Total time : 7638.64 ms
  REPEAT # 5 Baseline Total time : 7611.27 ms
  REPEAT # 6 Baseline Total time : 7629.32 ms
  REPEAT # 7 Baseline Total time : 7695.63 ms
  # Interleave weights set to 3:1
  REPEAT # 8 Baseline Total time : 7077.5 ms
  REPEAT # 9 Baseline Total time : 5633.23 ms
  REPEAT # 10 Baseline Total time : 5644.6 ms
  REPEAT # 11 Baseline Total time : 5627.66 ms
  REPEAT # 12 Baseline Total time : 5629.76 ms
  REPEAT # 13 Baseline Total time : 5633.05 ms
  REPEAT # 14 Baseline Total time : 5641.24 ms
  REPEAT # 15 Baseline Total time : 5631.18 ms
  REPEAT # 16 Baseline Total time : 5631.33 ms

Updating the interleave weights and having DAMON migrate the workload data
according to the weights resulted in an approximarely 25% speedup.

Patches Sequence
================
Patches 1-7 extend the DAMON API to specify multiple destination nodes and
weights for the migrate_{hot,cold} actions.  These patches are from SJ'S
RFC [8].

Patches 8-10 add a vaddr implementation of the migrate_{hot,cold} schemes.

Patch 11 modifies the vaddr migrate_{hot,cold} schemes to interleave data
according to the weights provided by damos-&gt;migrate_dest.

Patches 12-13 allow the vaddr migrate_{hot,cold} implementation to filter
out folios like the paddr version.


This patch (of 13):

Introduce a new struct, namely damos_migrate_dests, for specifying
multiple DAMOS' migration destination nodes and their weights.

Link: https://lkml.kernel.org/r/20250709005952.17776-1-bijan311@gmail.com
Link: https://lkml.kernel.org/r/20250709005952.17776-2-bijan311@gmail.com
Link: https://lore.kernel.org/linux-mm/20250520141236.2987309-1-joshua.hahnjy@gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/20250313155705.1943522-1-joshua.hahnjy@gmail.com/ [2]
Link: https://elixir.bootlin.com/linux/v6.15.4/source/mm/mempolicy.c#L2015 [3]
Link: https://lore.kernel.org/damon/20250624223310.55786-1-sj@kernel.org/ [4]
Link: https://lore.kernel.org/linux-mm/20250314151137.892379-1-joshua.hahnjy@gmail.com/ [5]
Link: https://lore.kernel.org/linux-mm/87frjfx6u4.fsf@DESKTOP-5N7EMDA/ [6]
Link: https://github.com/SNU-ARC/MERCI [7]
Link: https://lore.kernel.org/damon/20250702051558.54138-1-sj@kernel.org/ [8]
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Bijan Tabatabai &lt;bijantabatab@micron.com&gt;
Reviewed-by: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Ravi Shankar Jonnalagadda &lt;ravis.opensrc@micron.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon/sysfs: use DAMON core API damon_is_running()</title>
<updated>2025-07-20T01:59:44Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-07-05T17:49:58Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=d2b5be741a5045272b9d711908eab017632ac022'/>
<id>urn:sha1:d2b5be741a5045272b9d711908eab017632ac022</id>
<content type='text'>
DAMON core implements a static function to see if a given DAMON context is
running.  DAMON sysfs interface is implementing the same one on its own. 
Make the core function non-static and reuse it from the DAMON sysfs
interface.

Link: https://lkml.kernel.org/r/20250705175000.56259-5-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon: fix minor typos in damon header</title>
<updated>2025-07-10T05:42:14Z</updated>
<author>
<name>Nathan Gao</name>
<email>zcgao@amazon.com</email>
</author>
<published>2025-06-18T16:33:31Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=f9550e1fcf3bef802c18fadc5c65485b66b28a63'/>
<id>urn:sha1:f9550e1fcf3bef802c18fadc5c65485b66b28a63</id>
<content type='text'>
Fix typos in include/linux/damon.h.

Link: https://lkml.kernel.org/r/20250618163331.54910-1-sj@kernel.org
Signed-off-by: Nathan Gao &lt;zcgao@amazon.com&gt;
Reviewed-by: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/damon/core: introduce damos quota goal metrics for memory node utilization</title>
<updated>2025-05-13T06:50:29Z</updated>
<author>
<name>SeongJae Park</name>
<email>sj@kernel.org</email>
</author>
<published>2025-04-20T19:40:24Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=0e1c773b501f33437d87b72c7d26080361e224b1'/>
<id>urn:sha1:0e1c773b501f33437d87b72c7d26080361e224b1</id>
<content type='text'>
Patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered
memory".


Utilizing DAMON for memory tiering usually requires manual tuning and/or
tedious controls.  Let it self-tune hotness and coldness thresholds for
promotion and demotion aiming high utilization of high memory tiers, by
introducing new DAMOS quota goal metrics representing the used and the
free memory ratios of specific NUMA nodes.  And introduce a sample DAMON
module that demonstrates how the new feature can be used for memory
tiering use cases.

Backgrounds
===========

A type of tiered memory system exposes the memory tiers as NUMA nodes.  A
straightforward pages placement strategy for such systems is placing
access-hot and cold pages on upper and lower tiers, reespectively,
pursuing higher utilization of upper tiers.  Since access temperature can
be dynamic, periodically finding and migrating hot pages and cold pages to
proper tiers (promoting and demoting) is also required.  Linux kernel
provides several features for such dynamic and transparent pages
placement.

Page Faults and LRU
-------------------

One widely known way is using NUMA balancing in tiering mode (a.k.a
NUMAB-2) and reclaim-based demotion features.  In the setup, NUMAB-2 finds
hot pages using access check-purpose page faults (a.k.a prot_none) and
promote those inside each process' context, until there is no more pages
to promote, or the upper tier is filled up and memory pressure happens. 
In the latter case, LRU-based reclaim logic wakes up as a response to the
memory pressure and demotes cold pages to lower tiers in asynchronous
(kswapd) and/or synchronous ways (direct reclaim).

DAMON
-----

Yet another available solution is using DAMOS with migrate_hot and
migrate_cold DAMOS actions for promotions and demotions, respectively.  To
make it optimum, users need to specify aggressiveness and access
temperature thresholds for promotions and demotions in a good balance that
results in high utilization of upper tiers.  The number of parameters is
not small, and optimum parameter values depend on characteristics of the
underlying hardware and the workload.  As a result, it often requires
manual, time consuming and repetitive tuning of the DAMOS schemes for
given workloads and systems combinations.

Self-tuned DAMON-based Memory Tiering
=====================================

To solve such manual tuning problems, DAMOS provides aim-oriented
feedback-driven quotas self-tuning.  Using the feature, we design a
self-tuned DAMON-based memory tiering for general multi-tier memory
systems.

For each memory tier node, if it has a lower tier, run a DAMOS scheme that
demotes cold pages of the node, auto-tuning the aggressiveness aiming an
amount of free space of the node.  The free space is for keeping the
headroom that avoids significant memory pressure during upper tier memory
usage spike, and promoting hot pages from the lower tier.

For each memory tier node, if it has an upper tier, run a DAMOS scheme
that promotes hot pages of the current node to the upper tier, auto-tuning
the aggressiveness aiming a high utilization ratio of the upper tier.  The
target ratio is to ensure higher tiers are utilized as much as possible. 
It should match with the headroom for demotion scheme, but have slight
overlap, to ensure promotion and demotion are not entirely stopped.

The aim-oriented aggressiveness auto-tuning of DAMOS is already available.
Hence, to make such tiering solution implementation, only new quota goal
metrics for utilization and free space ratio of specific NUMA node need to
be developed.

Discussions
===========

The design imposes below discussion points.

Expected Behaviors
------------------

The system will let upper tier memory node accommodates as many hot data
as possible.  If total amount of the data is less than the top tier
memory's promotion/demotion target utilization, entire data will be just
placed on the top tier.  Promotion scheme will do nothing since there is
no data to promote.  Demotion scheme will also do nothing since the free
space ratio of the top tier is higher than the goal.

Only if the amount of data is larger than the top tier's utilization
ratio, demotion scheme will demote cold pages and ensure the headroom free
space.  Since the promotion and demotion schemes for a single node has
small overlap at their target utilization and free space goals, promotions
and demotions will continue working with a moderate aggressiveness level. 
It will keep all data is placed on access hotness under dynamic access
pattern, while minimizing the migration overhead.

In any case, each node will keep headroom free space and as many upper
tiers are utilized as possible.

Ease of Use
-----------

Users still need to set the target utilization and free space ratio, but
it will be easier to set.  We argue 99.7 % utilization and 0.5 % free
space ratios can be good default values.  It can be easily adjusted based
on desired headroom size of given use case.  Users are also still required
to answer the minimum coldness and hotness thresholds.  Together with
monitoring intervals auto-tuning[2], DAMON will always show meaningful
amount of hot and cold memory.  And DAMOS quota's prioritization mechanism
will make good decision as long as the source information is that
colorful.  Hence, users can very naively set the minimum criterias.  We
believe any access observation and no access observation within last one
aggregation interval is enough for minimum hot and cold regions criterias.

General Tiered Memory Setup Applicability
-----------------------------------------

The design can be applied to any number of tiers having any performance
characteristics, as long as they can be hierarchical.  Hence, applying the
system to different tiered memory system will be straightforward.  Note
that this assumes only single CPU NUMA node case.  Because today's DAMON
is not aware of which CPU made each access, applying this on systems
having multiple CPU NUMA nodes can be complicated.  We are planning to
extend DAMON for the use case, but that's out of the scope of this patch
series.

How To Use
----------

Users can implement the auto-tuned DAMON-based memory tiering using DAMON
sysfs interface.  It can be easily done using DAMON user-space tool like
user-space tool.  Below evaluation results section shows an example DAMON
user-space tool command for that.

For wider and simpler deployment, having a kernel module that sets up and
run the DAMOS schemes via DAMON kernel API can be useful.  The module can
enable the memory tiering at boot time via kernel command line parameter
or at run time with single command.  This patch series implements a sample
DAMON kernel module that shows how such module can be implemented.

Comparison To Page Faults and LRU-based Approaches
--------------------------------------------------

The existing page faults based promotion (NUMAB-2) does hot pages
detection and migration in the process context.  When there are many pages
to promote, it can block the progress of the application's real works. 
DAMOS works in asynchronous worker thread, so it doesn't block the real
works.

NUMAB-2 doesn't provide a way to control aggressiveness of promotion other
than the maximum amount of pages to promote per given time widnow.  If hot
pages are found, promotions can happen in the upper-bound speed,
regardless of upper tier's memory pressure.  If the maximum speed is not
well set for the given workload, it can result in slow promotion or
unnecessary memory pressure.  Self-tuned DAMON-based memory tiering
alleviates the problem by adjusting the speed based on current utilization
of the upper tier.

LRU-based demotion can be triggered in both asynchronous (kswapd) and
synchronous (direct reclaim) ways.  Other than the way of finding cold
pages, asynchronous LRU-based demotion and DAMON-based demotion has no big
difference.  DAMON-based demotion can make a better balancing with
DAMON-based promotion, though.  The LRU-based demotion can do better than
DAMON-based demotion when the tier is having significant memory pressure. 
It would be wise to use DAMON-based demotion as a proactive and primary
one, but utilizing LRU-based demotions together as a fast backup solution.

Evaluation
==========

In short, under a setup that requires fast and frequent promotions,
self-tuned DAMON-based memory tiering's hot pages promotion improves
performance about 4.42 %.  We believe this shows self-tuned DAMON-based
promotion's effectiveness.  Meanwhile, NUMAB-2's hot pages promotion
degrades the performance about 7.34 %.  We suspect the degradation is
mostly due to NUMAB-2's synchronous nature that can block the
application's progress, which highlights the advantage of DAMON-based
solution's asynchronous nature.

Note that the test was done with the RFC version of this patch series.  We
don't run it again since this patch series got no meaningful change after
the RFC, while the test takes pretty long time.

Setup
-----

Hardware.  Use a machine that equips 250 GiB DRAM memory tier and 50 GiB
CXL memory tier.  The tiers are exposed as NUMA nodes 0 and 1,
respectively.

Kernel.  Use Linux kernel v6.13 that modified as following.  Add all DAMON
patches that available on mm tree of 2025-03-15, and this patch series. 
Also modify it to ignore mempolicy() system calls, to avoid bad effects
from application's traditional NUMA systems assumed optimizations.

Workload.  Use a modified version of Taobench benchmark[3] that available
on DCPerf benchmark suite.  It represents an in-memory caching workload. 
We set its 'memsize', 'warmup_time', and 'test_time' parameter as 340 GiB,
2,500 seconds and 1,440 seconds.  The parameters are chosen to ensure the
workload uses more than DRAM memory tier.  Its RSS under the parameter
grows to 270 GiB within the warmup time.

It turned out the workload has a very static access pattrn.  Only about 13
% of the RSS is frequently accessed from the beginning to end.  Hence
promotion shows no meaningful performance difference regardless of
different design and implementations.  We therefore modify the kernel to
periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold
pages once per minute.  The intention is to simulate periodic access
pattern changes.  The hotness and coldness threshold is very naively set
so that it is more like random access pattern change rather than strict
hot/cold pages exchange.  This is why we call the workload as "modified". 
It is implemented as two DAMOS schemes each running on an asynchronous
thread.  It can be reproduced with DAMON user-space tool like below.

    # ./damo start \
        --ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \
            --damos_action migrate_hot 1 \
            --damos_quota_interval 60s --damos_quota_space 10G \
        --ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \
            --damos_action migrate_cold 0 \
            --damos_quota_interval 60s --damos_quota_space 10G \
        --nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1

System configurations.  Use below variant system configurations.

- Baseline.  No memory tiering features are turned on.
- Numab_tiering.  On the baseline, enable NUMAB-2 and relcaim-based
  demotion.  In detail, following command is executed:
  echo 2 &gt; /proc/sys/kernel/numa_balancing;
  echo 1 &gt; /sys/kernel/mm/numa/demotion_enabled;
  echo 7 &gt; /proc/sys/vm/zone_reclaim_mode
- DAMON_tiering.  On the baseline, utilize self-tuned DAMON-based memory
  tiering implementation via DAMON user-space tool.  It utilizes two
  kernel threads, namely promotion thread and demotion thread.  Demotion
  thread monitors access pattern of DRAM node using DAMON with
  auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio,
  and demote coldest pages up to 200 MiB per second aiming 0.5% free
  space of DRAM node.  Promotion thread monitors CXL node using same
  intervals auto-tuning, and promote hot pages in same way but aiming
  for 99.7% utilization of DRAM node.  Because DAMON provides only
  best-effort accuracy, add young page DAMOS filters to allow only and
  reject all young pages at promoting and demoting, respectively.  It
  can be reproduced with DAMON user-space tool like below.

    # ./damo start \
        --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
            --damos_action migrate_cold 1 --damos_access_rate 0% 0% \
            --damos_apply_interval 1s \
            --damos_quota_interval 1s --damos_quota_space 200MB \
            --damos_quota_goal node_mem_free_bp 0.5% 0 \
            --damos_filter reject young \
        --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
            --damos_action migrate_hot 0 --damos_access_rate 5% max \
            --damos_apply_interval 1s \
            --damos_quota_interval 1s --damos_quota_space 200MB \
            --damos_quota_goal node_mem_used_bp 99.7% 0 \
            --damos_filter allow young \
            --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
        --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1

Measurment Results
------------------

On each system configuration, run the modified version of Taobench and
collect 'score'.  'score' is a metric that calculated and provided by
Taobench to represents the performance of the run on the system.  To
handle the measurement errors, repeat the measurement five times.  The
results are as below.

    Config         Score   Stdev   (%)     Normalized
    Baseline       1.6165  0.0319  1.9764  1.0000
    Numab_tiering  1.4976  0.0452  3.0209  0.9264
    DAMON_tiering  1.6881  0.0249  1.4767  1.0443

'Config' column shows the system config of the measurement.  'Score'
column shows the 'score' measurement in average of the five runs on the
system config.  'Stdev' column shows the standsard deviation of the five
measurements of the scores.  '(%)' column shows the 'Stdev' to 'Score'
ratio in percentage.  Finally, 'Normalized' column shows the averaged
score values of the configs that normalized to that of 'Baseline'.

The periodic hot pages demotion and cold pages promotion that was
conducted to simulate dynamic access pattern was started from the
beginning of the workload.  It resulted in the DRAM tier utilization
always under the watermark, and hence no real demotion was happened for
all test runs.  This means the above results show no difference between
LRU-based and DAMON-based demotions.  Only difference between NUMAB-2 and
DAMON-based promotions are represented on the results.

Numab_tiering config degraded the performance about 7.36 %.  We suspect
this happened because NUMAB-2's synchronous promotion was blocking the
Taobench's real work progress.

DAMON_tiering config improved the performance about 4.43 %.  We believe
this shows effectiveness of DAMON-based promotion that didn't block
Taobench's real work progress due to its asynchronous nature.  Also this
means DAMON's monitoring results are accurate enough to provide visible
amount of improvement.

Evaluation Limitations
----------------------

As mentioned above, this evaluation shows only comparison of promotion
mechanisms.  DAMON-based tiering is recommended to be used together with
reclaim-based demotion as a faster backup under significant memory
pressure, though.

From some perspective, the modified version of Taobench may seems making
the picture distorted too much.  It would be better to evaluate with more
realistic workload, or more finely tuned micro benchmarks.

Patch Sequence
==============

The first patch (patch 1) implements two new quota goal metrics on core
layer and expose it to DAMON core kernel API.  The second and third ones
(patches 2 and 3) further link it to DAMON sysfs interface.  Three
following patches (patches 4-6) document the new feature and sysfs file on
design, usage, and ABI documents.  The final one (patch 7) implements a
working version of a self-tuned DAMON-based memory tiering solution in an
incomplete but easy to understand form as a kernel module under
samples/damon/ directory.

References
==========

[1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/
[2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org
[3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md


This patch (of 7):

Used and free space ratios for specific NUMA nodes can be useful inputs
for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback loop.
Implement DAMOS quota goal metrics for such self-tuned schemes.

Link: https://lkml.kernel.org/r/20250420194030.75838-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250420194030.75838-2-sj@kernel.org
Signed-off-by: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Yunjeong Mun &lt;yunjeong.mun@sk.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
