| Age | Commit message (Collapse) | Author |
|
Rename some functions in gaccess.c to add a _gva or _gpa suffix to
indicate whether the function accepts a virtual or a guest-absolute
address.
This makes it easier to understand the code.
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Enable KVM_GENERIC_MMU_NOTIFIER, for now with empty placeholder callbacks.
Also enable KVM_MMU_LOCKLESS_AGING and define KVM_HAVE_MMU_RWLOCK.
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Pass the gmap explicitly as parameter, instead of just using
vsie_page->gmap.
This will be used in upcoming patches.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
If uv_convert_from_secure_pte() fails, the page becomes unusable by the
host. The failure can only occour in case of hardware malfunction or a
serious KVM bug.
When the unusable page is reused, the system can have issues and
hang.
Print a warning to aid debugging such unlikely scenarios.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Export __make_folio_secure() and s390_wiggle_split_folio(), as they will
be needed to be used by KVM.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Introduce import_lock to avoid future races when converting pages to
secure.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Add gmap_helper_set_unused() to mark userspace ptes as unused.
Core mm code will use that information to discard unused pages instead
of attempting to swap them.
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Tested-by: Nico Boehr <nrb@linux.ibm.com>
Acked-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Move the sske_frame() function to asm/pgtable.h, so it can be used in
other modules too.
Opportunistically convert the .insn opcode specification to the
appropriate mnemonic.
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
uv_destroy_folio() and uv_convert_from_secure_folio() should work on
all pages in the folio, not just the first one.
This was fine until now, but it will become a problem with upcoming
patches.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Add P bit in hardware definition of region 3 and segment table entries.
Move union vaddress from kvm/gaccess.c to asm/dat_bits.h
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
Move the pgste lock and unlock functions back into mm/pgtable.c and
duplicate them in mm/gmap_helpers.c to avoid function name collisions
later on.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux into soc/dt
SoCFPGA DTS updates for v6.20, version 3
- dt-bindings updates:
- Add intel,socfpga-agilex5-socdk-modular for the Agilex5 mod board
- Add intel,socfpga-agilex-emmc for the Agilex eMMC daughter board
- Move entries in intel,socfpga.yaml into altera.yaml
- Add syscon as a fallback for sys-mgr
- Add dma-cohrerent property for Agilex5 NAND and DMA
- Add support for the Agilex5 modular board
- Add IOMMUS property for ethernet nodes for Agilex5
- Use lowercase hex for dts files
- Add #address-cells and #size-cells for sram
- Fix dtbs_check warning for fpga-region
- Move dma controller node for Agilex5 under simple-bus
- Add support for the Agilex eMMC daughter board
* tag 'socfpga_dts_updates_for_v6.20_v3' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
dt-bindings: intel: Add Agilex eMMC support
arm64: dts: socfpga: agilex: add emmc support
arm64: dts: intel: agilex5: Add simple-bus node on top of dma controller node
ARM: dts: socfpga: fix dtbs_check warning for fpga-region
ARM: dts: socfpga: add #address-cells and #size-cells for sram node
dt-bindings: altera: document syscon as fallback for sys-mgr
arm64: dts: altera: Use lowercase hex
dt-bindings: arm: altera: combine Intel's SoCFPGA into altera.yaml
arm64: dts: socfpga: agilex5: Add IOMMUS property for ethernet nodes
arm64: dts: socfpga: agilex5: add support for modular board
dt-bindings: intel: Add Agilex5 SoCFPGA modular board
arm64: dts: socfpga: agilex5: Add dma-coherent property
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
The LOAD_DISPLAY (LDD) X'9F' is still accepted by the Virtual Tape
Server (VTS) but does not perform any action.
Remove all functions and definitions related to this command.
The tape_34xx_ioctl() function is also removed as it was mainly used to
handle additional ioctl functionality. LOAD_DISPLAY was the only left
case. All other ioctls are handled in tapechar_ioctl().
With LOAD_DISPLAY, the remaining definitions in asm/tape390.h are gone.
Delete the file.
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Physical 3590/3592 tape models are not supported anymore for a very long
time. The Virtual Tape Server (VTS) emulates and presents only 3490E
models to the host. This is the only supported model and storage server.
Remove the entire code base for 3590/3592 models as it can be considered
dead code for quite some time already.
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The comment in hyperv_cleanup() became out-of-date as a result of
commit c8ed0812646e ("x86/hyperv: Use direct call to hypercall-page").
Update the comment. No code or functional change.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
When running with a paravisor and SEV-SNP, the GHCB page is provided
by the paravisor instead of being allocated by Linux. The provided page
is normal memory, but is outside of the physical address space seen by
Linux. As such it cannot be accessed via the kernel's direct map, and
must be explicitly mapped to a kernel virtual address.
Current code uses ioremap_cache() and iounmap() to map and unmap the page.
These functions are for use on I/O address space that may not behave as
normal memory, so they generate or expect addresses with the __iomem
attribute. For normal memory, the preferred functions are memremap() and
memunmap(), which operate similarly but without __iomem.
At the time of the original work on CoCo VMs on Hyper-V, memremap() did not
support creating a decrypted mapping, so ioremap_cache() was used instead,
since I/O address space is always mapped decrypted. memremap() has since
been enhanced to allow decrypted mappings, so replace ioremap_cache() with
memremap() when mapping the GHCB page. Similarly, replace iounmap() with
memunmap(). As a side benefit, the replacement cleans up 'sparse' warnings
about __iomem mismatches.
The replacement is done to use the correct functions as long-term goodness
and to clean up the sparse warnings. No runtime bugs are fixed.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311111925.iPGGJik4-lkp@intel.com/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
hv_root_crash_init() is not setting up the hypervisor crash collection
for baremetal cases because when it's called, hypervisor page is not
setup.
Fix is simple, just move the crash init call after the hypercall
page setup.
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Fix a compiler warning that status is defined by not used.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512301641.FC6OAbGM-lkp@intel.com/
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Use scoped for-each loop when iterating over device nodes to make code a
bit simpler.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://patch.msgid.link/20260109-of-for-each-compatible-scoped-v3-5-c22fa2c0749a@oss.qualcomm.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
Use scoped for-each loop when iterating over device nodes to make code a
bit simpler.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://patch.msgid.link/20260109-of-for-each-compatible-scoped-v3-4-c22fa2c0749a@oss.qualcomm.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
Use scoped for-each loop when iterating over device nodes to make code a
bit simpler.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260109-of-for-each-compatible-scoped-v3-3-c22fa2c0749a@oss.qualcomm.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
Use scoped for-each loop when iterating over device nodes to make code a
bit simpler.
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Link: https://patch.msgid.link/20260109-of-for-each-compatible-scoped-v3-2-c22fa2c0749a@oss.qualcomm.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
|
|
When none of the allowed CPUs of a task are online, it gets migrated
to the fallback cpumask which is all the non nohz_full CPUs.
However just like nohz_full CPUs, domain isolated CPUs don't want to be
disturbed by tasks that have lost their CPU affinities.
And since nohz_full rely on domain isolation to work correctly, the
housekeeping mask of domain isolated CPUs should always be a subset of
the housekeeping mask of nohz_full CPUs (there can be CPUs that are
domain isolated but not nohz_full, OTOH there shouldn't be nohz_full
CPUs that are not domain isolated):
HK_TYPE_DOMAIN & HK_TYPE_KERNEL_NOISE == HK_TYPE_DOMAIN
Therefore use HK_TYPE_DOMAIN as the appropriate fallback target for
tasks. Note that cpuset isolated partitions are not supported on those
systems and may result in undefined behaviour.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
|
|
Currently, if the command line passed to kexec_file_load() exceeds
the supported limit of the kernel being kexec'd, -EINVAL is returned
to userspace, which is consistent across architectures. Since
-EINVAL is not specific to this case, the kexec tool cannot provide
a specific reason for the failure. Many architectures emit an error
message in this case. Add a similar error message, including the
effective limit, since the command line length is configurable.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Enable BLK_DEV_NULL_BLK as module in defconfig and debug_defconfig, so the
Null Test Block Device Driver can be easily used for testing purposes.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
Read-mostly sched_clock_irqtime may share the same cacheline with
frequently updated nohz struct. Make it as static_key to avoid
false sharing issue.
The only user of disable_sched_clock_irqtime()
is tsc_.*mark_unstable() which may be invoked under atomic context
and require a workqueue to disable static_key. But both of them
calls clear_sched_clock_stable() just before doing
disable_sched_clock_irqtime(). We can reuse
"sched_clock_work" to also disable sched_clock_irqtime().
One additional case need to handle is if the tsc is marked unstable
before late_initcall() phase, sched_clock_work will not be invoked
and sched_clock_irqtime will stay enabled although clock is unstable:
tsc_init()
enable_sched_clock_irqtime() # irqtime accounting is enabled here
...
if (unsynchronized_tsc()) # true
mark_tsc_unstable()
clear_sched_clock_stable()
__sched_clock_stable_early = 0;
...
if (static_key_count(&sched_clock_running.key) == 2)
# Only happens at sched_clock_init_late()
__clear_sched_clock_stable(); # Never executed
...
# late_initcall() phase
sched_clock_init_late()
if (__sched_clock_stable_early) # Already false
__set_sched_clock_stable(); # sched_clock is never marked stable
# TSC unstable, but sched_clock_work won't run to disable irqtime
So we need to disable_sched_clock_irqtime() in sched_clock_init_late()
if clock is unstable.
Reported-by: Benjamin Lei <benjamin.lei@intel.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260127072509.2627346-1-wangyang.guo@intel.com
|
|
Update to avoid conflicts with /urgent patches.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
|
|
alloc_gcs() returns an error-encoded pointer on failure, which comes
from do_mmap(), not NULL.
The current NULL check fails to detect errors, which could lead to using
an invalid GCS address.
Use IS_ERR_VALUE() to properly detect errors, consistent with the
check in gcs_alloc_thread_stack().
Fixes: b57180c75c7e ("arm64/gcs: Implement shadow stack prctl() interface")
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
R_MIPS_PC32 is a GNU extension, its definition is available in glibc
only since 2.39 (released in 2024), and not available in musl libc yet.
Provide our own definition for R_MIPS_PC32 and use it if necessary to
fix relocs tool building on musl and older glibc systems.
Fixes: ff79d31eb536 ("mips: Add support for PC32 relocations in vmlinux")
Signed-off-by: Yao Zi <me@ziyao.cc>
Link: https://patch.msgid.link/20260202041610.61389-1-me@ziyao.cc
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
|
|
The original patch inverted the PTE unconditionally to avoid
L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level
paging.
Adjust the logic to use the flip_protnone_guard() helper, which is a nop
on 2-level paging but inverts the address bits in all other paging modes.
This doesn't matter for the Xen aspect of the original change. Linux no
longer supports running 32bit PV under Xen, and Xen doesn't support
running any 32bit PV guests without using PAE paging.
Link: https://lkml.kernel.org/r/20260126211046.2096622-1-andrew.cooper3@citrix.com
Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs")
Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The comment for dc_l2tlb_miss incorrectly describes it as a "hit".
This contradicts the field name and the actual bit definition.
Fix the comment to correctly describe it as a "miss".
Signed-off-by: Xiang-Bin Shi <eric91102091@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260123075719.160734-1-eric91102091@gmail.com
|
|
The NV stage-2 manipulation functions kvm_nested_s2_unmap(),
kvm_nested_s2_wp(), and others, are being called for any stage-2
manipulation regardless of whether nested virtualization is supported or
enabled for the VM.
For protected KVM (pKVM), `struct kvm_pgtable` uses the
`pkvm_mappings` member of the union. This member aliases `ia_bits`,
which is used by the non-protected NV code paths. Attempting to
read `pgt->ia_bits` in these functions results in treating
protected mapping pointers or state values as bit-shift amounts.
This triggers a UBSAN shift-out-of-bounds error:
UBSAN: shift-out-of-bounds in arch/arm64/kvm/nested.c:1127:34
shift exponent 174565952 is too large for 64-bit type 'unsigned long'
Call trace:
__ubsan_handle_shift_out_of_bounds+0x28c/0x2c0
kvm_nested_s2_unmap+0x228/0x248
kvm_arch_flush_shadow_memslot+0x98/0xc0
kvm_set_memslot+0x248/0xce0
Since pKVM and NV are mutually exclusive, prevent entry into these
NV handling functions if the VM has not allocated any nested MMUs
(i.e., `kvm->arch.nested_mmus_size` is 0).
Fixes: 7270cc9157f47 ("KVM: arm64: nv: Handle VNCR_EL2 invalidation from MMU notifiers")
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202152310.113467-1-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The implementation of __READ_ONCE() under CONFIG_LTO=y incorrectly
qualified the fallback "once" access for types larger than 8 bytes,
which are not atomic but should still happen "once" and suppress common
compiler optimizations.
The cast `volatile typeof(__x)` applied the volatile qualifier to the
pointer type itself rather than the pointee. This created a volatile
pointer to a non-volatile type, which violated __READ_ONCE() semantics.
Fix this by casting to `volatile typeof(*__x) *`.
With a defconfig + LTO + debug options build, we see the following
functions to be affected:
xen_manage_runstate_time (884 -> 944 bytes)
xen_steal_clock (248 -> 340 bytes)
^-- use __READ_ONCE() to load vcpu_runstate_info structs
Fixes: e35123d83ee3 ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y")
Cc: stable@vger.kernel.org
Reviewed-by: Boqun Feng <boqun@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The current implementation uses `vgic_state_iter` in `struct
vgic_dist` to track the sequence position. This effectively makes the
iterator shared across all open file descriptors for the VM.
This approach has significant drawbacks:
- It enforces mutual exclusion, preventing concurrent reads of the
debugfs file (returning -EBUSY).
- It relies on storing transient iterator state in the long-lived
VM structure (`vgic_dist`).
Refactor the implementation to use the standard `seq_file` iterator.
Instead of storing state in `kvm_arch`, rely on the `pos` argument
passed to the `start` and `next` callbacks, which tracks the logical
index specific to the file descriptor.
This change enables concurrent access and eliminates the
`vgic_state_iter` field from `struct vgic_dist`.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202085721.3954942-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The vgic-debug interface implementation uses XArray marks
(`LPI_XA_MARK_DEBUG_ITER`) to "snapshot" LPIs at the start of iteration.
This modifies global state for a read-only operation and complicates
reference counting, leading to leaks if iteration is aborted or fails.
Reimplement the iterator to use dynamic iteration logic:
- Remove `lpi_idx` from `struct vgic_state_iter`.
- Replace the XArray marking mechanism with dynamic iteration using
`xa_find_after(..., XA_PRESENT)`.
- Wrap XArray traversals in `rcu_read_lock()`/`rcu_read_unlock()` to
ensure safety against concurrent modifications (e.g., LPI unmapping).
- Handle potential races where an LPI is removed during iteration by
gracefully skipping it in `show()`, rather than warning.
- Remove the unused `LPI_XA_MARK_DEBUG_ITER` definition.
This simplifies the lifecycle management of the iterator and prevents
resource leaks associated with the marking mechanism, and paves the way
for using a standard seq_file iterator.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202085721.3954942-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The current implementation uses `idreg_debugfs_iter` in `struct
kvm_arch` to track the sequence position. This effectively makes the
iterator shared across all open file descriptors for the VM.
This approach has significant drawbacks:
- It enforces mutual exclusion, preventing concurrent reads of the
debugfs file (returning -EBUSY).
- It relies on storing transient iterator state in the long-lived VM
structure (`kvm_arch`).
- The use of `u8` for the iterator index imposes an implicit limit of
255 registers. While not currently exceeded, this is fragile against
future architectural growth. Switching to `loff_t` eliminates this
overflow risk.
Refactor the implementation to use the standard `seq_file` iterator.
Instead of storing state in `kvm_arch`, rely on the `pos` argument
passed to the `start` and `next` callbacks, which tracks the logical
index specific to the file descriptor.
This change enables concurrent access and eliminates the
`idreg_debugfs_iter` field from `struct kvm_arch`.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202085721.3954942-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
This partially reverts commit 87af633689ce16ddb166c80f32b120e50b1295de so
the current memory target for PV guests is still fetched from
start_info->nr_pages, which matches exactly what the toolstack sets the
initial memory target to.
Using get_num_physpages() is possible on PV also, but needs adjusting to
take into account the ISA hole and the PFN at 0 not considered usable
memory despite being populated, and hence would need extra adjustments.
Instead of carrying those extra adjustments switch back to the previous
code. That leaves Linux with a difference in how current memory target is
obtained for HVM vs PV, but that's better than adding extra logic just for
PV.
However if switching to start_info->nr_pages for PV domains we need to
differentiate between released pages (freed back to the hypervisor) as
opposed to pages in the physmap which are not populated to start with.
Introduce a new xen_unpopulated_pages to account for papges that have
never been populated, and hence in the PV case don't need subtracting.
Fixes: 87af633689ce ("x86/xen: fix balloon target initialization for PVH dom0")
Reported-by: James Dingwall <james@dingwall.me.uk>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260128110510.46425-2-roger.pau@citrix.com>
|
|
Passing IRQF_ONESHOT ensures that the interrupt source is masked until
the secondary (threaded) handler is done. If only a primary handler is
used then the flag makes no sense because the interrupt can not fire
(again) while its handler is running.
Revert adding IRQF_ONESHOT to irqflags.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/20260128095540.863589-9-bigeasy@linutronix.de
|
|
In preparation for decoupling linux/instruction_pointer.h and
linux/kernel.h, include instruction_pointer.h explicitly where needed.
Link: https://lkml.kernel.org/r/20260116042510.241009-5-ynorov@nvidia.com
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Aaron Tomlin <atomlin@atomlin.com>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
required to merge "kho: use unsigned long for nr_pages".
|
|
The current swap entry allocation/freeing workflow has never had a clear
definition. This makes it hard to debug or add new optimizations.
This commit introduces a proper definition of how swap entries would be
allocated and freed. Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks. Also making more optimization
possible.
Swap entry will be mostly freed and free with a folio bound. The folio
lock will be useful for resolving many swap related races.
Now swap allocation (except hibernation) always starts with a folio in the
swap cache, and gets duped/freed protected by the folio lock:
- folio_alloc_swap() - The only allocation entry point now.
Context: The folio must be locked.
This allocates one or a set of continuous swap slots for a folio and
binds them to the folio by adding the folio to the swap cache. The
swap slots' swap count start with zero value.
- folio_dup_swap() - Increase the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This increases the ref count of swap entries allocated to a folio.
Newly allocated swap slots' count has to be increased by this helper
as the folio got unmapped (and swap entries got installed).
- folio_put_swap() - Decrease the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This decreases the ref count of swap entries allocated to a folio.
Typically, swapin will decrease the swap count as the folio got
installed back and the swap entry got uninstalled
This won't remove the folio from the swap cache and free the
slot. Lazy freeing of swap cache is helpful for reducing IO.
There is already a folio_free_swap() for immediate cache reclaim.
This part could be further optimized later.
The above locking constraints could be further relaxed when the swap table
is fully implemented. Currently dup still needs the caller to lock the
swap entry container (e.g. PTL), or a concurrent zap may underflow the
swap count.
Some swap users need to interact with swap count without involving folio
(e.g. forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:
- swap_put_entries_direct() - Decrease the swap count directly.
Context: The caller must lock whatever is referencing the slots to
avoid a race.
Typically the page table zapping or shmem mapping truncate will need
to free swap slots directly. If a slot is cached (has a folio bound),
this will also try to release the swap cache.
- swap_dup_entry_direct() - Increase the swap count directly.
Context: The caller must lock whatever is referencing the entries to
avoid race, and the entries must already have a swap count > 1.
Typically, forking will need to copy the page table and hence needs to
increase the swap count of the entries in the table. The page table is
locked while referencing the swap entries, so the entries all have a
swap count > 1 and can't be freed.
Hibernation subsystem is a bit different, so two special wrappers are here:
- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
helper.
All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.
By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.
This commit should not introduce any behavior change
[kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi]
Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
[ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris]
Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <ryncsn@gmail.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Chris Mason <clm@meta.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The lazy_mmu_mode KUnit tests call lazy_mmu_mode_{enable,disable}. These
tests may be built as a module, and because of inlining this means that
arch_{enter,flush,leave}_lazy_mmu_mode need to be exported.
[akpm@linux-foundation.org: remove mm/tests/lazy_mmu_mode_kunit.c comment, per Kevin]
Link: https://lkml.kernel.org/r/20251218100541.2667405-1-kevin.brodsky@arm.com
Fixes: ee628d9cc8d5 ("mm: add basic tests for lazy_mmu")
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's make it consistent with the naming of the files but also with the
naming of CONFIG_BALLOON_MIGRATION.
While at it, add a "/* CONFIG_BALLOON */".
Link: https://lkml.kernel.org/r/20260119230133.3551867-24-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While compaction depends on migration, the other direction is not the
case. So let's make it clearer that this is all about migration of
balloon pages.
Adjust all comments/docs in the core to talk about "migration" instead of
"compaction".
While at it add some "/* CONFIG_BALLOON_MIGRATION */".
Link: https://lkml.kernel.org/r/20260119230133.3551867-23-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Even without CONFIG_BALLOON_COMPACTION this infrastructure implements
basic list and page management for a memory balloon.
Link: https://lkml.kernel.org/r/20260119230133.3551867-21-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's centralize it, by allowing for the driver to enable this handling
through a new flag (bool for now) in the balloon device info.
Note that we now adjust the counter when adding/removing a page into the
balloon list: when removing a page to deflate it, it will now happen
before the driver communicated with hypervisor, not afterwards.
This shouldn't make a difference in practice.
Link: https://lkml.kernel.org/r/20260119230133.3551867-7-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's update the balloon page references, the balloon page list, the
BALLOON_MIGRATE counter and the isolated-pages counter in
balloon_page_migrate(), after letting the balloon->migratepage() callback
deal with the actual inflation+deflation.
Note that we now perform the balloon list modifications outside of any
implementation-specific locks: which is fine, there is nothing special
about these page actions that the lock would be protecting.
The old page is already no longer in the list (isolated) and the new page
is not yet in the list.
Let's use -ENOENT to communicate the special "inflation of new page failed
after already deflating the old page" to balloon_page_migrate() so it can
handle it accordingly.
While at it, rename balloon->b_dev_info to make it match the other
functions. Also, drop the comment above balloon_page_migrate(), which
seems unnecessary.
Link: https://lkml.kernel.org/r/20260119230133.3551867-6-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that there is not a lot of logic left, let's just inline setting up
the migration function.
To avoid #ifdef in the caller we can instead use IS_ENABLED() and make the
compiler happy by only providing the function declaration.
Now that the function is gone, drop the "out_balloon_compaction" label.
Note that before commit 68f2736a8583 ("mm: Convert all PageMovable users
to movable_operations") we actually had to undo something, now not
anymore.
Link: https://lkml.kernel.org/r/20260119230133.3551867-4-david@kernel.org
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
swap: fix race of truncate and swap entry split", needed for merging "mm,
swap: cleanup swap entry management workflow".
|
|
Implement fsession support in the arm64 BPF JIT trampoline.
Extend the trampoline stack layout to store function metadata and
session cookies, and pass the appropriate metadata to fentry and
fexit programs. This mirrors the existing x86 behavior and enables
session cookies on arm64.
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260131144950.16294-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|