diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-07-31 14:57:54 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-07-31 14:57:54 -0700 |
| commit | beace86e61e465dba204a268ab3f3377153a4973 (patch) | |
| tree | 24f90cb26bf39eb7724326cdf3e8bffed7c05e50 /mm/internal.h | |
| parent | cbbf0a759ff96c80dfc32192a2cc427b79447f74 (diff) | |
| parent | af915c3c13b64d196d1c305016092f5da20942c4 (diff) | |
Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
Diffstat (limited to 'mm/internal.h')
| -rw-r--r-- | mm/internal.h | 150 |
1 files changed, 92 insertions, 58 deletions
diff --git a/mm/internal.h b/mm/internal.h index 0f73ff13c212..1da16d550a45 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -149,7 +149,7 @@ static inline void *folio_raw_mapping(const struct folio *folio) { unsigned long mapping = (unsigned long)folio->mapping; - return (void *)(mapping & ~PAGE_MAPPING_FLAGS); + return (void *)(mapping & ~FOLIO_MAPPING_FLAGS); } /* @@ -202,94 +202,106 @@ static inline void vma_close(struct vm_area_struct *vma) /* Flags for folio_pte_batch(). */ typedef int __bitwise fpb_t; -/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ -#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) +/* Compare PTEs respecting the dirty bit. */ +#define FPB_RESPECT_DIRTY ((__force fpb_t)BIT(0)) -/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ -#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) +/* Compare PTEs respecting the soft-dirty bit. */ +#define FPB_RESPECT_SOFT_DIRTY ((__force fpb_t)BIT(1)) + +/* Compare PTEs respecting the writable bit. */ +#define FPB_RESPECT_WRITE ((__force fpb_t)BIT(2)) + +/* + * Merge PTE write bits: if any PTE in the batch is writable, modify the + * PTE at @ptentp to be writable. + */ +#define FPB_MERGE_WRITE ((__force fpb_t)BIT(3)) + +/* + * Merge PTE young and dirty bits: if any PTE in the batch is young or dirty, + * modify the PTE at @ptentp to be young or dirty, respectively. + */ +#define FPB_MERGE_YOUNG_DIRTY ((__force fpb_t)BIT(4)) static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) { - if (flags & FPB_IGNORE_DIRTY) + if (!(flags & FPB_RESPECT_DIRTY)) pte = pte_mkclean(pte); - if (likely(flags & FPB_IGNORE_SOFT_DIRTY)) + if (likely(!(flags & FPB_RESPECT_SOFT_DIRTY))) pte = pte_clear_soft_dirty(pte); - return pte_wrprotect(pte_mkold(pte)); + if (likely(!(flags & FPB_RESPECT_WRITE))) + pte = pte_wrprotect(pte); + return pte_mkold(pte); } /** - * folio_pte_batch - detect a PTE batch for a large folio + * folio_pte_batch_flags - detect a PTE batch for a large folio * @folio: The large folio to detect a PTE batch for. - * @addr: The user virtual address the first page is mapped at. - * @start_ptep: Page table pointer for the first entry. - * @pte: Page table entry for the first page. + * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL. + * @ptep: Page table pointer for the first entry. + * @ptentp: Pointer to a COPY of the first page table entry whose flags this + * function updates based on @flags if appropriate. * @max_nr: The maximum number of table entries to consider. * @flags: Flags to modify the PTE batch semantics. - * @any_writable: Optional pointer to indicate whether any entry except the - * first one is writable. - * @any_young: Optional pointer to indicate whether any entry except the - * first one is young. - * @any_dirty: Optional pointer to indicate whether any entry except the - * first one is dirty. * * Detect a PTE batch: consecutive (present) PTEs that map consecutive - * pages of the same large folio. + * pages of the same large folio in a single VMA and a single page table. * * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, - * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and - * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY). + * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set) + * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set). + * + * @ptep must map any page of the folio. max_nr must be at least one and + * must be limited by the caller so scanning cannot exceed a single VMA and + * a single page table. * - * start_ptep must map any page of the folio. max_nr must be at least one and - * must be limited by the caller so scanning cannot exceed a single page table. + * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will + * be updated: it's crucial that a pointer to a COPY of the first + * page table entry, obtained through ptep_get(), is provided as @ptentp. + * + * This function will be inlined to optimize based on the input parameters; + * consider using folio_pte_batch() instead if applicable. * * Return: the number of table entries in the batch. */ -static inline int folio_pte_batch(struct folio *folio, unsigned long addr, - pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, - bool *any_writable, bool *any_young, bool *any_dirty) +static inline unsigned int folio_pte_batch_flags(struct folio *folio, + struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp, + unsigned int max_nr, fpb_t flags) { - pte_t expected_pte, *ptep; - bool writable, young, dirty; - int nr, cur_nr; - - if (any_writable) - *any_writable = false; - if (any_young) - *any_young = false; - if (any_dirty) - *any_dirty = false; + bool any_writable = false, any_young = false, any_dirty = false; + pte_t expected_pte, pte = *ptentp; + unsigned int nr, cur_nr; VM_WARN_ON_FOLIO(!pte_present(pte), folio); VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio); + /* + * Ensure this is a pointer to a copy not a pointer into a page table. + * If this is a stack value, it won't be a valid virtual address, but + * that's fine because it also cannot be pointing into the page table. + */ + VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp))); /* Limit max_nr to the actual remaining PFNs in the folio we could batch. */ max_nr = min_t(unsigned long, max_nr, folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte)); - nr = pte_batch_hint(start_ptep, pte); + nr = pte_batch_hint(ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); - ptep = start_ptep + nr; + ptep = ptep + nr; while (nr < max_nr) { pte = ptep_get(ptep); - if (any_writable) - writable = !!pte_write(pte); - if (any_young) - young = !!pte_young(pte); - if (any_dirty) - dirty = !!pte_dirty(pte); - pte = __pte_batch_clear_ignored(pte, flags); - if (!pte_same(pte, expected_pte)) + if (!pte_same(__pte_batch_clear_ignored(pte, flags), expected_pte)) break; - if (any_writable) - *any_writable |= writable; - if (any_young) - *any_young |= young; - if (any_dirty) - *any_dirty |= dirty; + if (flags & FPB_MERGE_WRITE) + any_writable |= pte_write(pte); + if (flags & FPB_MERGE_YOUNG_DIRTY) { + any_young |= pte_young(pte); + any_dirty |= pte_dirty(pte); + } cur_nr = pte_batch_hint(ptep, pte); expected_pte = pte_advance_pfn(expected_pte, cur_nr); @@ -297,9 +309,19 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, nr += cur_nr; } + if (any_writable) + *ptentp = pte_mkwrite(*ptentp, vma); + if (any_young) + *ptentp = pte_mkyoung(*ptentp); + if (any_dirty) + *ptentp = pte_mkdirty(*ptentp); + return min(nr, max_nr); } +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, + unsigned int max_nr); + /** * pte_move_swp_offset - Move the swap entry offset field of a swap pte * forward or backward by delta @@ -436,8 +458,7 @@ void zap_page_range_single_batched(struct mmu_gather *tlb, int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, gfp_t gfp); -void page_cache_ra_order(struct readahead_control *, struct file_ra_state *, - unsigned int order); +void page_cache_ra_order(struct readahead_control *, struct file_ra_state *); void force_page_cache_ra(struct readahead_control *, unsigned long nr); static inline void force_page_cache_readahead(struct address_space *mapping, struct file *file, pgoff_t index, unsigned long nr_to_read) @@ -517,6 +538,16 @@ extern unsigned long highest_memmap_pfn; bool folio_isolate_lru(struct folio *folio); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +#ifdef CONFIG_NUMA +int user_proactive_reclaim(char *buf, + struct mem_cgroup *memcg, pg_data_t *pgdat); +#else +static inline int user_proactive_reclaim(char *buf, + struct mem_cgroup *memcg, pg_data_t *pgdat) +{ + return 0; +} +#endif /* * in mm/rmap.c: @@ -821,7 +852,8 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align, int nid, bool exact_nid); void memmap_init_range(unsigned long, int, unsigned long, unsigned long, - unsigned long, enum meminit_context, struct vmem_altmap *, int); + unsigned long, enum meminit_context, struct vmem_altmap *, int, + bool); #if defined CONFIG_COMPACTION || defined CONFIG_CMA @@ -929,7 +961,7 @@ extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *locked); extern long faultin_page_range(struct mm_struct *mm, unsigned long start, unsigned long end, bool write, int *locked); -extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, +extern bool mlock_future_ok(struct mm_struct *mm, vm_flags_t vm_flags, unsigned long bytes); /* @@ -1227,7 +1259,6 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long); extern void set_pageblock_order(void); -struct folio *alloc_migrate_folio(struct folio *src, unsigned long private); unsigned long reclaim_pages(struct list_head *folio_list); unsigned int reclaim_clean_pages_from_list(struct zone *zone, struct list_head *folio_list); @@ -1360,7 +1391,7 @@ int migrate_device_coherent_folio(struct folio *folio); struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long shift, - unsigned long flags, unsigned long start, + vm_flags_t vm_flags, unsigned long start, unsigned long end, int node, gfp_t gfp_mask, const void *caller); @@ -1605,6 +1636,9 @@ static inline void accept_page(struct page *page) int walk_page_range_mm(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private); +int walk_page_range_debug(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + pgd_t *pgd, void *private); /* pt_reclaim.c */ bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval); |
