diff options
| author | Baolin Wang <baolin.wang@linux.alibaba.com> | 2026-02-09 22:07:24 +0800 |
|---|---|---|
| committer | Andrew Morton <akpm@linux-foundation.org> | 2026-02-12 15:43:00 -0800 |
| commit | 52e054f7184097bea009963e033cdd54af7bf8a2 (patch) | |
| tree | e58a3466ac379f7b5aa390e97fccc372a258ce3a /mm | |
| parent | f615cc92641a403d354c6ee68263074a86de49c7 (diff) | |
mm: rmap: support batched checks of the references for large folios
Patch series "support batch checking of references and unmapping for large
folios", v6.
Currently, folio_referenced_one() always checks the young flag for each
PTE sequentially, which is inefficient for large folios. This
inefficiency is especially noticeable when reclaiming clean file-backed
large folios, where folio_referenced() is observed as a significant
performance hotspot.
Moreover, on Arm architecture, which supports contiguous PTEs, there is
already an optimization to clear the young flags for PTEs within a
contiguous range. However, this is not sufficient. We can extend this to
perform batched operations for the entire large folio (which might exceed
the contiguous range: CONT_PTE_SIZE).
Similar to folio_referenced_one(), we can also apply batched unmapping for
large file folios to optimize the performance of file folio reclamation.
By supporting batched checking of the young flags, flushing TLB entries,
and unmapping, I can observed a significant performance improvements in my
performance tests for file folios reclamation. Please check the
performance data in the commit message of each patch.
This patch (of 5):
Currently, folio_referenced_one() always checks the young flag for each
PTE sequentially, which is inefficient for large folios. This
inefficiency is especially noticeable when reclaiming clean file-backed
large folios, where folio_referenced() is observed as a significant
performance hotspot.
Moreover, on Arm64 architecture, which supports contiguous PTEs, there is
already an optimization to clear the young flags for PTEs within a
contiguous range. However, this is not sufficient. We can extend this to
perform batched operations for the entire large folio (which might exceed
the contiguous range: CONT_PTE_SIZE).
Introduce a new API: clear_flush_young_ptes() to facilitate batched
checking of the young flags and flushing TLB entries, thereby improving
performance during large folio reclamation. And it will be overridden by
the architecture that implements a more efficient batch operation in the
following patches.
While we are at it, rename ptep_clear_flush_young_notify() to
clear_flush_young_ptes_notify() to indicate that this is a batch
operation.
Link: https://lkml.kernel.org/r/cover.1770645603.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/12132694536834262062d1fb304f8f8a064b6750.1770645603.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'mm')
| -rw-r--r-- | mm/rmap.c | 28 |
1 files changed, 25 insertions, 3 deletions
diff --git a/mm/rmap.c b/mm/rmap.c index ab099405151f..3dbc2c4e02dc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -913,9 +913,11 @@ static bool folio_referenced_one(struct folio *folio, struct folio_referenced_arg *pra = arg; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); int ptes = 0, referenced = 0; + unsigned int nr; while (page_vma_mapped_walk(&pvmw)) { address = pvmw.address; + nr = 1; if (vma->vm_flags & VM_LOCKED) { ptes++; @@ -960,9 +962,21 @@ static bool folio_referenced_one(struct folio *folio, if (lru_gen_look_around(&pvmw)) referenced++; } else if (pvmw.pte) { - if (ptep_clear_flush_young_notify(vma, address, - pvmw.pte)) + if (folio_test_large(folio)) { + unsigned long end_addr = pmd_addr_end(address, vma->vm_end); + unsigned int max_nr = (end_addr - address) >> PAGE_SHIFT; + pte_t pteval = ptep_get(pvmw.pte); + + nr = folio_pte_batch(folio, pvmw.pte, + pteval, max_nr); + } + + ptes += nr; + if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr)) referenced++; + /* Skip the batched PTEs */ + pvmw.pte += nr - 1; + pvmw.address += (nr - 1) * PAGE_SIZE; } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { if (pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) @@ -972,7 +986,15 @@ static bool folio_referenced_one(struct folio *folio, WARN_ON_ONCE(1); } - pra->mapcount--; + pra->mapcount -= nr; + /* + * If we are sure that we batched the entire folio, + * we can just optimize and stop right here. + */ + if (ptes == pvmw.nr_pages) { + page_vma_mapped_walk_done(&pvmw); + break; + } } if (referenced) |
