summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2026-03-19KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activatedSean Christopherson
[ Upstream commit 87d0f901a9bd8ae6be57249c737f20ac0cace93d ] Explicitly set/clear CR8 write interception when AVIC is (de)activated to fix a bug where KVM leaves the interception enabled after AVIC is activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8 will remain intercepted in perpetuity. On its own, the dangling CR8 intercept is "just" a performance issue, but combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging intercept is fatal to Windows guests as the TPR seen by hardware gets wildly out of sync with reality. Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this is firmly an SVM implementation flaw/detail. WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should never enter the guest with AVIC enabled and CR8 writes intercepted. Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Cc: Naveen N Rao (AMD) <naveen@kernel.org> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> [Squash fix to avic_deactivate_vmcb. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19KVM: SVM: Add a helper to look up the max physical ID for AVICNaveen N Rao
[ Upstream commit f2f6e67a56dc88fea7e9b10c4e79bb01d97386b7 ] To help with a future change, add a helper to look up the maximum physical ID depending on the vCPU AVIC mode. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/0ab9bf5e20a3463a4aa3a5ea9bbbac66beedf1d1.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 87d0f901a9bd ("KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19KVM: SVM: Limit AVIC physical max index based on configured max_vcpu_idsNaveen N Rao
[ Upstream commit 574ef752d4aea04134bc121294d717f4422c2755 ] KVM allows VMMs to specify the maximum possible APIC ID for a virtual machine through KVM_CAP_MAX_VCPU_ID capability so as to limit data structures related to APIC/x2APIC. Utilize the same to set the AVIC physical max index in the VMCB, similar to VMX. This helps hardware limit the number of entries to be scanned in the physical APIC ID table speeding up IPI broadcasts for virtual machines with smaller number of vCPUs. Unlike VMX, SVM AVIC requires a single page to be allocated for the Physical APIC ID table and the Logical APIC ID table, so retain the existing approach of allocating those during VM init. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/adb07ccdb3394cd79cb372ba6bcc69a4e4d4ef54.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 87d0f901a9bd ("KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19x86/apic: Disable x2apic on resume if the kernel expects soShashank Balaji
commit 8cc7dd77a1466f0ec58c03478b2e735a5b289b96 upstream. When resuming from s2ram, firmware may re-enable x2apic mode, which may have been disabled by the kernel during boot either because it doesn't support IRQ remapping or for other reasons. This causes the kernel to continue using the xapic interface, while the hardware is in x2apic mode, which causes hangs. This happens on defconfig + bare metal + s2ram. Fix this in lapic_resume() by disabling x2apic if the kernel expects it to be disabled, i.e. when x2apic_mode = 0. The ACPI v6.6 spec, Section 16.3 [1] says firmware restores either the pre-sleep configuration or initial boot configuration for each CPU, including MSR state: When executing from the power-on reset vector as a result of waking from an S2 or S3 sleep state, the platform firmware performs only the hardware initialization required to restore the system to either the state the platform was in prior to the initial operating system boot, or to the pre-sleep configuration state. In multiprocessor systems, non-boot processors should be placed in the same state as prior to the initial operating system boot. (further ahead) If this is an S2 or S3 wake, then the platform runtime firmware restores minimum context of the system before jumping to the waking vector. This includes: CPU configuration. Platform runtime firmware restores the pre-sleep configuration or initial boot configuration of each CPU (MSR, MTRR, firmware update, SMBase, and so on). Interrupts must be disabled (for IA-32 processors, disabled by CLI instruction). (and other things) So at least as per the spec, re-enablement of x2apic by the firmware is allowed if "x2apic on" is a part of the initial boot configuration. [1] https://uefi.org/specs/ACPI/6.6/16_Waking_and_Sleeping.html#initialization [ bp: Massage. ] Fixes: 6e1cb38a2aef ("x64, x2apic/intr-remap: add x2apic support, including enabling interrupt-remapping") Co-developed-by: Rahul Bukte <rahul.bukte@sony.com> Signed-off-by: Rahul Bukte <rahul.bukte@sony.com> Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260306-x2apic-fix-v2-1-bee99c12efa3@sony.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APICSean Christopherson
commit 3989a6d036c8ec82c0de3614bed23a1dacd45de5 upstream. Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the vCPU could activate AVIC at any point in its lifecycle. Configuring the VMCB if and only if AVIC is active "works" purely because of optimizations in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled *and* to defer updates until the first KVM_RUN. In quotes because KVM likely won't do the right thing if kvm_apicv_activated() is false, i.e. if a vCPU is created while APICv is inhibited at the VM level for whatever reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to vendor code due to seeing "apicv_active == activate". Cleaning up the initialization code will also allow fixing a bug where KVM incorrectly leaves CR8 interception enabled when AVIC is activated without creating a mess with respect to whether AVIC is activated or not. Cc: stable@vger.kernel.org Fixes: 67034bb9dd5e ("KVM: SVM: Add irqchip_split() checks before enabling AVIC") Fixes: 6c3e4422dd20 ("svm: Add support for dynamic APICv") Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMMJim Mattson
commit e2ffe85b6d2bb7780174b87aa4468a39be17eb81 upstream. Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted prior to commit 6b1dd26544d0 ("KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest"). Enable the quirk by default for backwards compatibility (like all quirks); userspace can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the constraints on WRMSR(IA32_DEBUGCTL). Note that the quirk only bypasses the consistency check. The vmcs02 bit is still owned by the host, and PMCs are not frozen during virtualized SMM. In particular, if a host administrator decides that PMCs should not be frozen during physical SMM, then L1 has no say in the matter. Fixes: 095686e6fcb4 ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter") Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com [sean: tag for stable@, clean-up and fix goofs in the comment and docs] Signed-off-by: Sean Christopherson <seanjc@google.com> [Rename quirk. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12kbuild: Split .modinfo out from ELF_DETAILSNathan Chancellor
commit 8678591b47469fe16357234efef9b260317b8be4 upstream. Commit 3e86e4d74c04 ("kbuild: keep .modinfo section in vmlinux.unstripped") added .modinfo to ELF_DETAILS while removing it from COMMON_DISCARDS, as it was needed in vmlinux.unstripped and ELF_DETAILS was present in all architecture specific vmlinux linker scripts. While this shuffle is fine for vmlinux, ELF_DETAILS and COMMON_DISCARDS may be used by other linker scripts, such as the s390 and x86 compressed boot images, which may not expect to have a .modinfo section. In certain circumstances, this could result in a bootloader failing to load the compressed kernel [1]. Commit ddc6cbef3ef1 ("s390/boot/vmlinux.lds.S: Ensure bzImage ends with SecureBoot trailer") recently addressed this for the s390 bzImage but the same bug remains for arm, parisc, and x86. The presence of .modinfo in the x86 bzImage was the root cause of the issue worked around with commit d50f21091358 ("kbuild: align modinfo section for Secureboot Authenticode EDK2 compat"). misc.c in arch/x86/boot/compressed includes lib/decompress_unzstd.c, which in turn includes lib/xxhash.c and its MODULE_LICENSE / MODULE_DESCRIPTION macros due to the STATIC definition. Split .modinfo out from ELF_DETAILS into its own macro and handle it in all vmlinux linker scripts. Discard .modinfo in the places where it was previously being discarded from being in COMMON_DISCARDS, as it has never been necessary in those uses. Cc: stable@vger.kernel.org Fixes: 3e86e4d74c04 ("kbuild: keep .modinfo section in vmlinux.unstripped") Reported-by: Ed W <lists@wildgooses.com> Closes: https://lore.kernel.org/587f25e0-a80e-46a5-9f01-87cb40cfa377@wildgooses.com/ [1] Tested-by: Ed W <lists@wildgooses.com> # x86_64 Link: https://patch.msgid.link/20260225-separate-modinfo-from-elf-details-v1-1-387ced6baf4b@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12x86/boot/sev: Move SEV decompressor variables into the .data sectionTom Lendacky
commit 4ca191cec17a997d0e3b2cd312f3a884288acc27 upstream. As part of the work to remove the dependency on calling into the decompressor code (startup_64()) for a UEFI boot, a call to rmpadjust() was removed from sev_enable() in favor of checking the value of the snp_vmpl variable. When booting through a non-UEFI path and calling startup_64(), the call to sev_enable() is performed before the BSS section is zeroed. With the removal of the rmpadjust() call and the corresponding check of the return code, the snp_vmpl variable is checked. Since the kernel is running at VMPL0, the snp_vmpl variable will not have been set and should be the default value of 0. However, since the call occurs before the BSS is zeroed, the snp_vmpl variable may not actually be zero, which will cause the guest boot to fail. Since the decompressor relocates itself, the BSS would need to be cleared both before and after the relocation, but this would, in effect, cause all of the changes to BSS variables before relocation to be lost after relocation. Instead, move the snp_vmpl variable into the .data section so that it is initialized and the value made safe during relocation. As a pre-caution against future changes, move other SEV-related decompressor variables into the .data section, too. Fixes: 68a501d7fd82 ("x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Changyuan Lyu <changyuanl@google.com> Tested-by: Kevin Hui <kevinhui@meta.com> Tested-by: Changyuan Lyu <changyuanl@google.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/5648b7de5b0a5d0dfef3785f9582b718678c6448.1770217260.git.thomas.lendacky@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12x86/sev: Allow IBPB-on-Entry feature for SNP guestsKim Phillips
commit 9073428bb204d921ae15326bb7d4558d9d269aab upstream. The SEV-SNP IBPB-on-Entry feature does not require a guest-side implementation. It was added in Zen5 h/w, after the first SNP Zen implementation, and thus was not accounted for when the initial set of SNP features were added to the kernel. In its abundant precaution, commit 8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support") included SEV_STATUS' IBPB-on-Entry bit as a reserved bit, thereby masking guests from using the feature. Allow guests to make use of IBPB-on-Entry when supported by the hypervisor, as the bit is now architecturally defined and safe to expose. Fixes: 8c29f0165405 ("x86/sev: Add SEV-SNP guest feature negotiation support") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@kernel.org Link: https://patch.msgid.link/20260203222405.4065706-2-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12x86/boot: Handle relative CONFIG_EFI_SBAT_FILE file pathsJan Stancek
commit 3d1973a0c76a78a4728cff13648a188ed486cf44 upstream. CONFIG_EFI_SBAT_FILE can be a relative path. When compiling using a different output directory (O=) the build currently fails because it can't find the filename set in CONFIG_EFI_SBAT_FILE: arch/x86/boot/compressed/sbat.S: Assembler messages: arch/x86/boot/compressed/sbat.S:6: Error: file not found: kernel.sbat Add $(srctree) as include dir for sbat.o. [ bp: Massage commit message. ] Fixes: 61b57d35396a ("x86/efi: Implement support for embedding SBAT data for x86") Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/f4eda155b0cef91d4d316b4e92f5771cb0aa7187.1772047658.git.jstancek@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12perf/x86/intel/uncore: Add per-scheduler IMC CAS count eventsZide Chen
commit 6a8a48644c4b804123e59dbfc5d6cd29a0194046 upstream. IMC on SPR and EMR does not support sub-channels. In contrast, CPUs that use gnr_uncores[] (e.g. Granite Rapids and Sierra Forest) implement two command schedulers (SCH0/SCH1) per memory channel, providing logically independent command and data paths. Do not reuse the spr_uncore_imc[] configuration for these CPUs. Instead, introduce a dedicated gnr_uncore_imc[] with per-scheduler events, so userspace can monitor SCH0 and SCH1 independently. On these CPUs, replace cas_count_{read,write} with cas_count_{read,write}_sch{0,1}. This may break existing userspace that relies on cas_count_{read,write}, prompting it to switch to the per-scheduler events, as the legacy event reports only partial traffic (SCH0). Fixes: 632c4bf6d007 ("perf/x86/intel/uncore: Support Granite Rapids") Fixes: cb4a6ccf3583 ("perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260210005225.20311-1-zide.chen@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12x86/efi: defer freeing of boot services memoryMike Rapoport (Microsoft)
commit a4b0bf6a40f3c107c67a24fbc614510ef5719980 upstream. efi_free_boot_services() frees memory occupied by EFI_BOOT_SERVICES_CODE and EFI_BOOT_SERVICES_DATA using memblock_free_late(). There are two issue with that: memblock_free_late() should be used for memory allocated with memblock_alloc() while the memory reserved with memblock_reserve() should be freed with free_reserved_area(). More acutely, with CONFIG_DEFERRED_STRUCT_PAGE_INIT=y efi_free_boot_services() is called before deferred initialization of the memory map is complete. Benjamin Herrenschmidt reports that this causes a leak of ~140MB of RAM on EC2 t3a.nano instances which only have 512MB or RAM. If the freed memory resides in the areas that memory map for them is still uninitialized, they won't be actually freed because memblock_free_late() calls memblock_free_pages() and the latter skips uninitialized pages. Using free_reserved_area() at this point is also problematic because __free_page() accesses the buddy of the freed page and that again might end up in uninitialized part of the memory map. Delaying the entire efi_free_boot_services() could be problematic because in addition to freeing boot services memory it updates efi.memmap without any synchronization and that's undesirable late in boot when there is concurrency. More robust approach is to only defer freeing of the EFI boot services memory. Split efi_free_boot_services() in two. First efi_unmap_boot_services() collects ranges that should be freed into an array then efi_free_boot_services() later frees them after deferred init is complete. Link: https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org Fixes: 916f676f8dc0 ("x86, efi: Retain boot service code until after switching to virtual mode") Cc: <stable@vger.kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-12KVM: x86: Add x2APIC "features" to control EOI broadcast suppressionKhushit Shah
[ Upstream commit 6517dfbcc918f970a928d9dc17586904bac06893 ] Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated by userspace), which KVM completely mishandles. When x2APIC support was first added, KVM incorrectly advertised and "enabled" Suppress EOI Broadcast, without fully supporting the I/O APIC side of the equation, i.e. without adding directed EOI to KVM's in-kernel I/O APIC. That flaw was carried over to split IRQCHIP support, i.e. KVM advertised support for Suppress EOI Broadcasts irrespective of whether or not the userspace I/O APIC implementation supported directed EOIs. Even worse, KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without support for directed EOI came to rely on the "spurious" broadcasts. KVM "fixed" the in-kernel I/O APIC implementation by completely disabling support for Suppress EOI Broadcasts in commit 0bcc3fb95b97 ("KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but didn't do anything to remedy userspace I/O APIC implementations. KVM's bogus handling of Suppress EOI Broadcast is problematic when the guest relies on interrupts being masked in the I/O APIC until well after the initial local APIC EOI. E.g. Windows with Credential Guard enabled handles interrupts in the following order: 1. Interrupt for L2 arrives. 2. L1 APIC EOIs the interrupt. 3. L1 resumes L2 and injects the interrupt. 4. L2 EOIs after servicing. 5. L1 performs the I/O APIC EOI. Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt storm, e.g. if the IRQ line is still asserted and userspace reacts to the EOI by re-injecting the IRQ, because the guest doesn't de-assert the line until step #4, and doesn't expect the interrupt to be re-enabled until step #5. Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way of knowing if the userspace I/O APIC supports directed EOIs, i.e. suppressing EOI broadcasts would result in interrupts being stuck masked in the userspace I/O APIC due to step #5 being ignored by userspace. And fully disabling support for Suppress EOI Broadcast is also undesirable, as picking up the fix would require a guest reboot, *and* more importantly would change the virtual CPU model exposed to the guest without any buy-in from userspace. Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to explicitly enable or disable support for Suppress EOI Broadcasts. This gives userspace control over the virtual CPU model exposed to the guest, as KVM should never have enabled support for Suppress EOI Broadcast without userspace opt-in. Not setting either flag will result in legacy quirky behavior for backward compatibility. Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear that commit c806a6ad35bf ("KVM: x86: call irq notifiers with directed EOI") was entirely correct, i.e. it may have simply papered over the lack of Directed EOI emulation in the I/O APIC. Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI is to support Directed EOI (KVM's name) irrespective of guest CPU vendor. Fixes: 7543a635aa09 ("KVM: x86: Add KVM exit for IOAPIC EOIs") Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com Cc: stable@vger.kernel.org Suggested-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Khushit Shah <khushit.shah@nutanix.com> Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com [sean: clean up minor formatting goofs and fix a comment typo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12x86/uprobes: Fix XOL allocation failure for 32-bit tasksOleg Nesterov
[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ] This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12unwind_user/x86: Teach FP unwind about start of functionPeter Zijlstra
[ Upstream commit ae25884ad749e7f6e0c3565513bdc8aa2554a425 ] When userspace is interrupted at the start of a function, before we get a chance to complete the frame, unwind will miss one caller. X86 has a uprobe specific fixup for this, add bits to the generic unwinder to support this. Suggested-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251024145156.GM4068168@noisy.programming.kicks-ass.net Stable-dep-of: d55c571e4333 ("x86/uprobes: Fix XOL allocation failure for 32-bit tasks") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12unwind_user/x86: Enable frame pointer unwinding on x86Josh Poimboeuf
[ Upstream commit 49cf34c0815f93fb2ea3ab5cfbac1124bd9b45d0 ] Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the unwind_user interfaces can be used. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20250827193828.347397433@kernel.org Stable-dep-of: d55c571e4333 ("x86/uprobes: Fix XOL allocation failure for 32-bit tasks") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12KVM: x86: Ignore -EBUSY when checking nested events from vcpu_block()Sean Christopherson
[ Upstream commit ead63640d4e72e6f6d464f4e31f7fecb79af8869 ] Ignore -EBUSY when checking nested events after exiting a blocking state while L2 is active, as exiting to userspace will generate a spurious userspace exit, usually with KVM_EXIT_UNKNOWN, and likely lead to the VM's demise. Continuing with the wakeup isn't perfect either, as *something* has gone sideways if a vCPU is awakened in L2 with an injected event (or worse, a nested run pending), but continuing on gives the VM a decent chance of surviving without any major side effects. As explained in the Fixes commits, it _should_ be impossible for a vCPU to be put into a blocking state with an already-injected event (exception, IRQ, or NMI). Unfortunately, userspace can stuff MP_STATE and/or injected events, and thus put the vCPU into what should be an impossible state. Don't bother trying to preserve the WARN, e.g. with an anti-syzkaller Kconfig, as WARNs can (hopefully) be added in paths where _KVM_ would be violating x86 architecture, e.g. by WARNing if KVM attempts to inject an exception or interrupt while the vCPU isn't running. Cc: Alessandro Ratti <alessandro@0x65c.net> Cc: stable@vger.kernel.org Fixes: 26844fee6ade ("KVM: x86: never write to memory from kvm_vcpu_check_block()") Fixes: 45405155d876 ("KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet inject") Link: https://syzkaller.appspot.com/text?tag=ReproC&x=10d4261a580000 Reported-by: syzbot+1522459a74d26b0ac33a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/671bc7a7.050a0220.455e8.022a.GAE@google.com Link: https://patch.msgid.link/20260109030657.994759-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12x86/acpi/boot: Correct acpi_is_processor_usable() check againYazen Ghannam
[ Upstream commit adbf61cc47cb72b102682e690ad323e1eda652c2 ] ACPI v6.3 defined a new "Online Capable" MADT LAPIC flag. This bit is used in conjunction with the "Enabled" MADT LAPIC flag to determine if a CPU can be enabled/hotplugged by the OS after boot. Before the new bit was defined, the "Enabled" bit was explicitly described like this (ACPI v6.0 wording provided): "If zero, this processor is unusable, and the operating system support will not attempt to use it" This means that CPU hotplug (based on MADT) is not possible. Many BIOS implementations follow this guidance. They may include LAPIC entries in MADT for unavailable CPUs, but since these entries are marked with "Enabled=0" it is expected that the OS will completely ignore these entries. However, QEMU will do the same (include entries with "Enabled=0") for the purpose of allowing CPU hotplug within the guest. Comment from QEMU function pc_madt_cpu_entry(): /* ACPI spec says that LAPIC entry for non present * CPU may be omitted from MADT or it must be marked * as disabled. However omitting non present CPU from * MADT breaks hotplug on linux. So possible CPUs * should be put in MADT but kept disabled. */ Recent Linux topology changes broke the QEMU use case. A following fix for the QEMU use case broke bare metal topology enumeration. Rework the Linux MADT LAPIC flags check to allow the QEMU use case only for guests and to maintain the ACPI spec behavior for bare metal. Remove an unnecessary check added to fix a bare metal case introduced by the QEMU "fix". [ bp: Change logic as Michal suggested. ] [ mingo: Removed misapplied -stable tag. ] Fixes: fed8d8773b8e ("x86/acpi/boot: Correct acpi_is_processor_usable() check") Fixes: f0551af02130 ("x86/topology: Ignore non-present APIC IDs in a present package") Closes: https://lore.kernel.org/r/20251024204658.3da9bf3f.michal.pecio@gmail.com Reported-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Michal Pecio <michal.pecio@gmail.com> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Link: https://lore.kernel.org/20251111145357.4031846-1-yazen.ghannam@amd.com Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12x86/cfi: Fix CFI rewrite for odd alignmentsPeter Zijlstra
[ Upstream commit 24c8147abb39618d74fcc36e325765e8fe7bdd7a ] Rustam reported his clang builds did not boot properly; turns out his .config has: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y set. Fix up the FineIBT code to deal with this unusual alignment. Fixes: 931ab63664f0 ("x86/ibt: Implement FineIBT") Reported-by: Rustam Kovhaev <rkovhaev@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Rustam Kovhaev <rkovhaev@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-12x86/fred: Correct speculative safety in fred_extint()Andrew Cooper
[ Upstream commit aa280a08e7d8fae58557acc345b36b3dc329d595 ] array_index_nospec() is no use if the result gets spilled to the stack, as it makes the believed safe-under-speculation value subject to memory predictions. For all practical purposes, this means array_index_nospec() must be used in the expression that accesses the array. As the code currently stands, it's the wrong side of irqentry_enter(), and 'index' is put into %ebp across the function call. Remove the index variable and reposition array_index_nospec(), so it's calculated immediately before the array access. Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260106131504.679932-1-andrew.cooper3@citrix.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04x86/kexec: Copy ACPI root pointer address from config tableArd Biesheuvel
[ Upstream commit e00ac9e5afb5d80c0168ec88d8e8662a54af8249 ] Dave reports that kexec may fail when the first kernel boots via the EFI stub but without EFI runtime services, as in that case, the RSDP address field in struct bootparams is never assigned. Kexec copies this value into the version of struct bootparams that it provides to the incoming kernel, which may have no other means to locate the ACPI root pointer. So take the value from the EFI config tables if no root pointer has been set in the first kernel's struct bootparams. Fixes: a1b87d54f4e4 ("x86/efistub: Avoid legacy decompressor when doing EFI boot") Cc: <stable@vger.kernel.org> # v6.1 Reported-by: Dave Young <dyoung@redhat.com> Tested-by: Dave Young <dyoung@redhat.com> Link: https://lore.kernel.org/linux-efi/aZQg_tRQmdKNadCg@darkstar.users.ipa.redhat.com/ Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04x86/kexec: add a sanity check on previous kernel's ima kexec bufferHarshit Mogalapalli
[ Upstream commit c5489d04337b47e93c0623e8145fcba3f5739efd ] When the second-stage kernel is booted via kexec with a limiting command line such as "mem=<size>", the physical range that contains the carried over IMA measurement list may fall outside the truncated RAM leading to a kernel panic. BUG: unable to handle page fault for address: ffff97793ff47000 RIP: ima_restore_measurement_list+0xdc/0x45a #PF: error_code(0x0000) – not-present page Other architectures already validate the range with page_is_ram(), as done in commit cbf9c4b9617b ("of: check previous kernel's ima-kexec-buffer against memory bounds") do a similar check on x86. Without carrying the measurement list across kexec, the attestation would fail. Link: https://lkml.kernel.org/r/20251231061609.907170-4-harshit.m.mogalapalli@oracle.com Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Fixes: b69a2afd5afc ("x86/kexec: Carry forward IMA measurement log on kexec") Reported-by: Paul Webb <paul.x.webb@oracle.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Cc: Alexander Graf <graf@amazon.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Betkov <bp@alien8.de> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Henry Willard <henry.willard@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Joel Granados <joel.granados@kernel.org> Cc: Jonathan McDowell <noodles@fb.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yifei Liu <yifei.l.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04KVM: x86: Add SRCU protection for reading PDPTRs in __get_sregs2()Vasiliy Kovalev
[ Upstream commit 95d848dc7e639988dbb385a8cba9b484607cf98c ] Add SRCU read-side protection when reading PDPTR registers in __get_sregs2(). Reading PDPTRs may trigger access to guest memory: kvm_pdptr_read() -> svm_cache_reg() -> load_pdptrs() -> kvm_vcpu_read_guest_page() -> kvm_vcpu_gfn_to_memslot() kvm_vcpu_gfn_to_memslot() dereferences memslots via __kvm_memslots(), which uses srcu_dereference_check() and requires either kvm->srcu or kvm->slots_lock to be held. Currently only vcpu->mutex is held, triggering lockdep warning: ============================= WARNING: suspicious RCU usage in kvm_vcpu_gfn_to_memslot 6.12.59+ #3 Not tainted include/linux/kvm_host.h:1062 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by syz.5.1717/15100: #0: ff1100002f4b00b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x1d5/0x1590 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0xf0/0x120 lib/dump_stack.c:120 lockdep_rcu_suspicious+0x1e3/0x270 kernel/locking/lockdep.c:6824 __kvm_memslots include/linux/kvm_host.h:1062 [inline] __kvm_memslots include/linux/kvm_host.h:1059 [inline] kvm_vcpu_memslots include/linux/kvm_host.h:1076 [inline] kvm_vcpu_gfn_to_memslot+0x518/0x5e0 virt/kvm/kvm_main.c:2617 kvm_vcpu_read_guest_page+0x27/0x50 virt/kvm/kvm_main.c:3302 load_pdptrs+0xff/0x4b0 arch/x86/kvm/x86.c:1065 svm_cache_reg+0x1c9/0x230 arch/x86/kvm/svm/svm.c:1688 kvm_pdptr_read arch/x86/kvm/kvm_cache_regs.h:141 [inline] __get_sregs2 arch/x86/kvm/x86.c:11784 [inline] kvm_arch_vcpu_ioctl+0x3e20/0x4aa0 arch/x86/kvm/x86.c:6279 kvm_vcpu_ioctl+0x856/0x1590 virt/kvm/kvm_main.c:4663 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl fs/ioctl.c:893 [inline] __x64_sys_ioctl+0x18b/0x210 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xbd/0x1d0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Fixes: 6dba94035203 ("KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2") Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org> Link: https://patch.msgid.link/20260123222801.646123-1-kovalev@altlinux.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04KVM: nSVM: Always use vmcb01 in VMLOAD/VMSAVE emulationYosry Ahmed
[ Upstream commit 127ccae2c185f62e6ecb4bf24f9cb307e9b9c619 ] Commit cc3ed80ae69f ("KVM: nSVM: always use vmcb01 to for vmsave/vmload of guest state") made KVM always use vmcb01 for the fields controlled by VMSAVE/VMLOAD, but it missed updating the VMLOAD/VMSAVE emulation code to always use vmcb01. As a result, if VMSAVE/VMLOAD is executed by an L2 guest and is not intercepted by L1, KVM will mistakenly use vmcb02. Always use vmcb01 instead of the current VMCB. Fixes: cc3ed80ae69f ("KVM: nSVM: always use vmcb01 to for vmsave/vmload of guest state") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260110004821.3411245-2-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04KVM: nSVM: Remove a user-triggerable WARN on nested_svm_load_cr3() succeedingSean Christopherson
[ Upstream commit fc3ba56385d03501eb582e4b86691ba378e556f9 ] Drop the WARN in svm_set_nested_state() on nested_svm_load_cr3() failing as it is trivially easy to trigger from userspace by modifying CPUID after loading CR3. E.g. modifying the state restoration selftest like so: --- tools/testing/selftests/kvm/x86/state_test.c +++ tools/testing/selftests/kvm/x86/state_test.c @@ -280,7 +280,16 @@ int main(int argc, char *argv[]) /* Restore state in a new VM. */ vcpu = vm_recreate_with_one_vcpu(vm); - vcpu_load_state(vcpu, state); + + if (stage == 4) { + state->sregs.cr3 = BIT(44); + vcpu_load_state(vcpu, state); + + vcpu_set_cpuid_property(vcpu, X86_PROPERTY_MAX_PHY_ADDR, 36); + __vcpu_nested_state_set(vcpu, &state->nested); + } else { + vcpu_load_state(vcpu, state); + } /* * Restore XSAVE state in a dummy vCPU, first without doing generates: WARNING: CPU: 30 PID: 938 at arch/x86/kvm/svm/nested.c:1877 svm_set_nested_state+0x34a/0x360 [kvm_amd] Modules linked in: kvm_amd kvm irqbypass [last unloaded: kvm] CPU: 30 UID: 1000 PID: 938 Comm: state_test Tainted: G W 6.18.0-rc7-58e10b63777d-next-vm Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:svm_set_nested_state+0x34a/0x360 [kvm_amd] Call Trace: <TASK> kvm_arch_vcpu_ioctl+0xf33/0x1700 [kvm] kvm_vcpu_ioctl+0x4e6/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x61/0xad0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Simply delete the WARN instead of trying to prevent userspace from shoving "illegal" state into CR3. For better or worse, KVM's ABI allows userspace to set CPUID after SREGS, and vice versa, and KVM is very permissive when it comes to guest CPUID. I.e. attempting to enforce the virtual CPU model when setting CPUID could break userspace. Given that the WARN doesn't provide any meaningful protection for KVM or benefit for userspace, simply drop it even though the odds of breaking userspace are minuscule. Opportunistically delete a spurious newline. Fixes: b222b0b88162 ("KVM: nSVM: refactor the CR3 reload on migration") Cc: stable@vger.kernel.org Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251216161755.1775409-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04KVM: x86: Return "unsupported" instead of "invalid" on access to unsupported ↵Sean Christopherson
PV MSR [ Upstream commit 5bb9ac1865123356337a389af935d3913ee917ed ] Return KVM_MSR_RET_UNSUPPORTED instead of '1' (which for all intents and purposes means "invalid") when rejecting accesses to KVM PV MSRs to adhere to KVM's ABI of allowing host reads and writes of '0' to MSRs that are advertised to userspace via KVM_GET_MSR_INDEX_LIST, even if the vCPU model doesn't support the MSR. E.g. running a QEMU VM with -cpu host,-kvmclock,kvm-pv-enforce-cpuid yields: qemu: error: failed to set MSR 0x12 to 0x0 qemu: target/i386/kvm/kvm.c:3301: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20251230205948.4094097-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04x86/sev: Use kfree_sensitive() when freeing a SNP message descriptorBorislav Petkov (AMD)
[ Upstream commit af05e558988ed004a20fc4de7d0f80cfbba663f0 ] Use the proper helper instead of an open-coded variant. Closes: https://lore.kernel.org/r/202512202235.WHPQkLZu-lkp@intel.com Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/20260112114147.GBaWTd-8HSy_Xp4S3X@fat_crate.local Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04perf/x86/intel: Add Airmont NPMartin Schiller
[ Upstream commit a08340fd291671c54d379d285b2325490ce90ddd ] The Intel / MaxLinear Airmont NP (aka Lightning Mountain) supports the same architectual and non-architecural events as Airmont. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20251124074846.9653-3-ms@dev.tdt.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04perf/x86/cstate: Add Airmont NPMartin Schiller
[ Upstream commit 3006911f284d769b0f66c12b39da130325ef1440 ] From the perspective of Intel cstate residency counters, the Airmont NP (aka Lightning Mountain) is identical to the Airmont. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20251124074846.9653-4-ms@dev.tdt.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04perf/x86/msr: Add Airmont NPMartin Schiller
[ Upstream commit 63dbadcafc1f4d1da796a8e2c0aea1e561f79ece ] Like Airmont, the Airmont NP (aka Intel / MaxLinear Lightning Mountain) supports SMI_COUNT MSR. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20251124074846.9653-2-ms@dev.tdt.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04x86/xen/pvh: Enable PAE mode for 32-bit guest only when CONFIG_X86_PAE is setHou Wenlong
[ Upstream commit db9aded979b491a24871e1621cd4e8822dbca859 ] The PVH entry is available for 32-bit KVM guests, and 32-bit KVM guests do not depend on CONFIG_X86_PAE. However, mk_early_pgtbl_32() builds different pagetables depending on whether CONFIG_X86_PAE is set. Therefore, enabling PAE mode for 32-bit KVM guests without CONFIG_X86_PAE being set would result in a boot failure during CR3 loading. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <d09ce9a134eb9cbc16928a5b316969f8ba606b81.1768017442.git.houwenlong.hwl@antgroup.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-26x86/hyperv: Fix error pointer dereferenceEthan Tidmore
[ Upstream commit 705d01c8d78121ee1634bfc602ac4b0ad1438fab ] The function idle_thread_get() can return an error pointer and is not checked for it. Add check for error pointer. Detected by Smatch: arch/x86/hyperv/hv_vtl.c:126 hv_vtl_bringup_vcpu() error: 'idle' dereferencing possible ERR_PTR() Fixes: 2b4b90e053a29 ("x86/hyperv: Use per cpu initial stack for vtl context") Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-26x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs pathJiri Olsa
[ Upstream commit aea251799998aa1b78eacdfb308f18ea114ea5b3 ] Mahe reported missing function from stack trace on top of kprobe multi program. The missing function is the very first one in the stacktrace, the one that the bpf program is attached to. # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}' Attaching 1 probe... do_syscall_64+134 entry_SYSCALL_64_after_hwframe+118 ('*' is used for kprobe_multi attachment) The reason is that the previous change (the Fixes commit) fixed stack unwind for tracepoint, but removed attached function address from the stack trace on top of kprobe multi programs, which I also overlooked in the related test (check following patch). The tracepoint and kprobe_multi have different stack setup, but use same unwind path. I think it's better to keep the previous change, which fixed tracepoint unwind and instead change the kprobe multi unwind as explained below. The bpf program stack unwind calls perf_callchain_kernel for kernel portion and it follows two unwind paths based on X86_EFLAGS_FIXED bit in pt_regs.flags. When the bit set we unwind from stack represented by pt_regs argument, otherwise we unwind currently executed stack up to 'first_frame' boundary. The 'first_frame' value is taken from regs.rsp value, but ftrace_caller and ftrace_regs_caller (ftrace trampoline) functions set the regs.rsp to the previous stack frame, so we skip the attached function entry. If we switch kprobe_multi unwind to use the X86_EFLAGS_FIXED bit, we set the start of the unwind to the attached function address. As another benefit we also cut extra unwind cycles needed to reach the 'first_frame' boundary. The speedup can be measured with trigger bench for kprobe_multi program and stacktrace support. - trigger bench with stacktrace on current code: kprobe-multi : 0.810 ± 0.001M/s kretprobe-multi: 0.808 ± 0.001M/s - and with the fix: kprobe-multi : 1.264 ± 0.001M/s kretprobe-multi: 1.401 ± 0.002M/s With the fix, the entry probe stacktrace: # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}' Attaching 1 probe... __x64_sys_newuname+9 do_syscall_64+134 entry_SYSCALL_64_after_hwframe+118 The return probe skips the attached function, because it's no longer on the stack at the point of the unwind and this way is the same how standard kretprobe works. # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}' Attaching 1 probe... do_syscall_64+134 entry_SYSCALL_64_after_hwframe+118 Fixes: 6d08340d1e35 ("Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"") Reported-by: Mahe Tardy <mahe.tardy@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20260126211837.472802-3-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-26x86/fgraph: Fix return_to_handler regs.rsp valueJiri Olsa
[ Upstream commit 8bc11700e0d23d4fdb7d8d5a73b2e95de427cabc ] The previous change (Fixes commit) messed up the rsp register value, which is wrong because it's already adjusted with FRAME_SIZE, we need the original rsp value. This change does not affect fprobe current kernel unwind, the !perf_hw_regs path perf_callchain_kernel: if (perf_hw_regs(regs)) { if (perf_callchain_store(entry, regs->ip)) return; unwind_start(&state, current, regs, NULL); } else { unwind_start(&state, current, NULL, (void *)regs->sp); } which uses pt_regs.sp as first_frame boundary (FRAME_SIZE shift makes no difference, unwind stil stops at the right frame). This change fixes the other path when we want to unwind directly from pt_regs sp/fp/ip state, which is coming in following change. Fixes: 20a0bc10272f ("x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20260126211837.472802-2-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-26perf/x86/core: Do not set bit width for unavailable countersSandipan Das
[ Upstream commit b456a6ba5756b6fb7e651775343e713bd08418e7 ] Not all x86 processors have fixed counters. It may also be the case that a processor has only fixed counters and no general-purpose counters. Set the bit widths corresponding to each counter type only if such counters are available. Fixes: b3d9468a8bd2 ("perf, x86: Expose perf capability to other modules") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-11-seanjc@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-26x86/cpu/amd: Correct the microcode table for ZenbleedAndrew Cooper
[ Upstream commit fb7bfa31b8e8569f154f2fe0ea6c2f03c0f087aa ] The good revisions are tied to exact steppings, meaning it's not valid to match on model number alone, let alone a range. This is probably only a latent issue. From public microcode archives, the following CPUs exist 17-30-00, 17-60-00, 17-70-00 and would be captured by the model ranges. They're likely pre-production steppings, and likely didn't get Zenbleed microcode, but it's still incorrect to compare them to a different steppings revision. Either way, convert the logic to use x86_match_min_microcode_rev(), which is the preferred mechanism. Fixes: 522b1d69219d ("x86/cpu/amd: Add a Zenbleed fix") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: x86@kernel.org Link: https://patch.msgid.link/20251126130352.880424-1-andrew.cooper3@citrix.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-26Partial revert "x86/xen: fix balloon target initialization for PVH dom0"Roger Pau Monne
[ Upstream commit 0949c646d64697428ff6257d52efa5093566868d ] This partially reverts commit 87af633689ce16ddb166c80f32b120e50b1295de so the current memory target for PV guests is still fetched from start_info->nr_pages, which matches exactly what the toolstack sets the initial memory target to. Using get_num_physpages() is possible on PV also, but needs adjusting to take into account the ISA hole and the PFN at 0 not considered usable memory despite being populated, and hence would need extra adjustments. Instead of carrying those extra adjustments switch back to the previous code. That leaves Linux with a difference in how current memory target is obtained for HVM vs PV, but that's better than adding extra logic just for PV. However if switching to start_info->nr_pages for PV domains we need to differentiate between released pages (freed back to the hypervisor) as opposed to pages in the physmap which are not populated to start with. Introduce a new xen_unpopulated_pages to account for papges that have never been populated, and hence in the PV case don't need subtracting. Fixes: 87af633689ce ("x86/xen: fix balloon target initialization for PVH dom0") Reported-by: James Dingwall <james@dingwall.me.uk> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260128110510.46425-2-roger.pau@citrix.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11x86/sev: Disable GCOV on noinstr objectBrendan Jackman
[ Upstream commit 9efb74f84ba82a9de81fc921baf3c5e2decf8256 ] With Debian clang version 19.1.7 (3+build5) there are calls to kasan_check_write() from __sev_es_nmi_complete(), which violates noinstr. Fix it by disabling GCOV for the noinstr object, as has been done for previous such instrumentation issues. Note that this file already disables __SANITIZE_ADDRESS__ and __SANITIZE_THREAD__, thus calls like kasan_check_write() ought to be nops regardless of GCOV. This has been fixed in other patches. However, to avoid any other accidental instrumentation showing up, (and since, in principle GCOV is instrumentation and hence should be disabled for noinstr code anyway), disable GCOV overall as well. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Marco Elver <elver@google.com> Link: https://patch.msgid.link/20251216-gcov-inline-noinstr-v3-3-10244d154451@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps()Sean Christopherson
commit f8ade833b733ae0b72e87ac6d2202a1afbe3eb4a upstream. Explicitly configure KVM's supported XSS as part of each vendor's setup flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested VMX is enabled, i.e. when nested=1. The late clearing results in nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks, ultimately leading to a mismatched VMCS config due to the reference config having the CET bits set, but every CPU's "local" config having the bits cleared. Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by kvm_x86_vendor_init(), before calling into vendor code, and not referenced between ops->hardware_setup() and their current/old location. Fixes: 69cc3e886582 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER") Cc: stable@vger.kernel.org Cc: Mathias Krause <minipli@grsecurity.net> Cc: John Allen <john.allen@amd.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Chao Gao <chao.gao@intel.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-11x86/kfence: fix booting on 32bit non-PAE systemsAndrew Cooper
commit 16459fe7e0ca6520a6e8f603de4ccd52b90fd765 upstream. The original patch inverted the PTE unconditionally to avoid L1TF-vulnerable PTEs, but Linux doesn't make this adjustment in 2-level paging. Adjust the logic to use the flip_protnone_guard() helper, which is a nop on 2-level paging but inverts the address bits in all other paging modes. This doesn't matter for the Xen aspect of the original change. Linux no longer supports running 32bit PV under Xen, and Xen doesn't support running any 32bit PV guests without using PAE paging. Link: https://lkml.kernel.org/r/20260126211046.2096622-1-andrew.cooper3@citrix.com Fixes: b505f1944535 ("x86/kfence: avoid writing L1TF-vulnerable PTEs") Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Closes: https://lore.kernel.org/lkml/CAKFNMokwjw68ubYQM9WkzOuH51wLznHpEOMSqtMoV1Rn9JV_gw@mail.gmail.com/ Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-11x86/vmware: Fix hypercall clobbersJosh Poimboeuf
commit 2687c848e57820651b9f69d30c4710f4219f7dbf upstream. Fedora QA reported the following panic: BUG: unable to handle page fault for address: 0000000040003e54 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20251119-3.fc43 11/19/2025 RIP: 0010:vmware_hypercall4.constprop.0+0x52/0x90 .. Call Trace: vmmouse_report_events+0x13e/0x1b0 psmouse_handle_byte+0x15/0x60 ps2_interrupt+0x8a/0xd0 ... because the QEMU VMware mouse emulation is buggy, and clears the top 32 bits of %rdi that the kernel kept a pointer in. The QEMU vmmouse driver saves and restores the register state in a "uint32_t data[6];" and as a result restores the state with the high bits all cleared. RDI originally contained the value of a valid kernel stack address (0xff5eeb3240003e54). After the vmware hypercall it now contains 0x40003e54, and we get a page fault as a result when it is dereferenced. The proper fix would be in QEMU, but this works around the issue in the kernel to keep old setups working, when old kernels had not happened to keep any state in %rdi over the hypercall. In theory this same issue exists for all the hypercalls in the vmmouse driver; in practice it has only been seen with vmware_hypercall3() and vmware_hypercall4(). For now, just mark RDI/RSI as clobbered for those two calls. This should have a minimal effect on code generation overall as it should be rare for the compiler to want to make RDI/RSI live across hypercalls. Reported-by: Justin Forbes <jforbes@fedoraproject.org> Link: https://lore.kernel.org/all/99a9c69a-fc1a-43b7-8d1e-c42d6493b41f@broadcom.com/ Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-30perf/x86/intel: Do not enable BTS for guestsFernand Sieber
commit 91dcfae0ff2b9b9ab03c1ec95babaceefbffb9f4 upstream. By default when users program perf to sample branch instructions (PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf interprets this as a special case and enables BTS (Branch Trace Store) as an optimization to avoid taking an interrupt on every branch. Since BTS doesn't virtualize, this optimization doesn't make sense when the request originates from a guest. Add an additional check that prevents this optimization for virtualized events (exclude_host). Reported-by: Jan H. Schönherr <jschoenh@amazon.de> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20251211183604.868641-1-sieberf@amazon.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-30x86: make page fault handling disable interrupts properlyCedric Xing
[ Upstream commit 614da1d3d4cdbd6e41aea06bc97ec15aacff6daf ] There's a big comment in the x86 do_page_fault() about our interrupt disabling code: * User address page fault handling might have reenabled * interrupts. Fixing up all potential exit points of * do_user_addr_fault() and its leaf functions is just not * doable w/o creating an unholy mess or turning the code * upside down. but it turns out that comment is subtly wrong, and the code as a result is also wrong. Because it's certainly true that we may have re-enabled interrupts when handling user page faults. And it's most certainly true that we don't want to bother fixing up all the cases. But what isn't true is that it's limited to user address page faults. The confusion stems from the fact that we have logic here that depends on the address range of the access, but other code then depends on the _context_ the access was done in. The two are not related, even though both of them are about user-vs-kernel. In other words, both user and kernel addresses can cause interrupts to have been enabled (eg when __bad_area_nosemaphore() gets called for user accesses to kernel addresses). As a result we should make sure to disable interrupts again regardless of the address range before returning to the low-level fault handling code. The __bad_area_nosemaphore() code actually did disable interrupts again after enabling them, just not consistently. Ironically, as noted in the original comment, fixing up all the cases is just not worth it, when the simple solution is to just do it unconditionally in one single place. So remove the incomplete case that unsuccessfully tried to do what the comment said was "not doable" in commit ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code"), and just make it do the simple and straightforward thing. Signed-off-by: Cedric Xing <cedric.xing@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Fixes: ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code") Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-30x86/kfence: avoid writing L1TF-vulnerable PTEsAndrew Cooper
commit b505f1944535f83d369ae68813e7634d11b990d3 upstream. For native, the choice of PTE is fine. There's real memory backing the non-present PTE. However, for XenPV, Xen complains: (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing To explain, some background on XenPV pagetables: Xen PV guests are control their own pagetables; they choose the new PTE value, and use hypercalls to make changes so Xen can audit for safety. In addition to a regular reference count, Xen also maintains a type reference count. e.g. SegDesc (referenced by vGDT/vLDT), Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a lower pagetable level). This is in order to prevent e.g. a page being inserted into the pagetables for which the guest has a writable mapping. For non-present mappings, all other bits become software accessible, and typically contain metadata rather a real frame address. There is nothing that a reference count could sensibly be tied to. As such, even if Xen could recognise the address as currently safe, nothing would prevent that frame from changing owner to another VM in the future. When Xen detects a PV guest writing a L1TF-PTE, it responds by activating shadow paging. This is normally only used for the live phase of migration, and comes with a reasonable overhead. KFENCE only cares about getting #PF to catch wild accesses; it doesn't care about the value for non-present mappings. Use a fully inverted PTE, to avoid hitting the slow path when running under Xen. While adjusting the logic, take the opportunity to skip all actions if the PTE is already in the right state, half the number PVOps callouts, and skip TLB maintenance on a !P -> P transition which benefits non-Xen cases too. Link: https://lkml.kernel.org/r/20260106180426.710013-1-andrew.cooper3@citrix.com Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23iommu/sva: invalidate stale IOTLB entries for kernel address spaceLu Baolu
commit e37d5a2d60a338c5917c45296bac65da1382eda5 upstream. Introduce a new IOMMU interface to flush IOTLB paging cache entries for the CPU kernel address space. This interface is invoked from the x86 architecture code that manages combined user and kernel page tables, specifically before any kernel page table page is freed and reused. This addresses the main issue with vfree() which is a common occurrence and can be triggered by unprivileged users. While this resolves the primary problem, it doesn't address some extremely rare case related to memory unplug of memory that was present as reserved memory at boot, which cannot be triggered by unprivileged users. The discussion can be found at the link below. Enable SVA on x86 architecture since the IOMMU can now receive notification to flush the paging cache before freeing the CPU kernel page table pages. Link: https://lkml.kernel.org/r/20251022082635.2462433-9-baolu.lu@linux.intel.com Link: https://lore.kernel.org/linux-iommu/04983c62-3b1d-40d4-93ae-34ca04b827e5@intel.com/ Co-developed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Suggested-by: Jann Horn <jannh@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23x86/mm: use pagetable_free()Lu Baolu
commit bf9e4e30f3538391745a99bc2268ec4f5e4a401e upstream. The kernel's memory management subsystem provides a dedicated interface, pagetable_free(), for freeing page table pages. Updates two call sites to use pagetable_free() instead of the lower-level __free_page() or free_pages(). This improves code consistency and clarity, and ensures the correct freeing mechanism is used. Link: https://lkml.kernel.org/r/20251022082635.2462433-7-baolu.lu@linux.intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23x86/mm: use 'ptdesc' when freeing PMD pagesDave Hansen
commit 412d000346ea38ac4b9bb715a86c73ef89d90dea upstream. There are a billion ways to refer to a physical memory address. One of the x86 PMD freeing code location chooses to use a 'pte_t *' to point to a PMD page and then call a PTE-specific freeing function for it. That's a bit wonky. Just use a 'struct ptdesc *' instead. Its entire purpose is to refer to page table pages. It also means being able to remove an explicit cast. Right now, pte_free_kernel() is a one-liner that calls pagetable_dtor_free(). Effectively, all this patch does is remove one superfluous __pa(__va(paddr)) conversion and then call pagetable_dtor_free() directly instead of through a helper. Link: https://lkml.kernel.org/r/20251022082635.2462433-5-baolu.lu@linux.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23x86/resctrl: Fix memory bandwidth counter width for HygonXiaochen Shen
commit 7517e899e1b87b4c22a92c7e40d8733c48e4ec3c upstream. The memory bandwidth calculation relies on reading the hardware counter and measuring the delta between samples. To ensure accurate measurement, the software reads the counter frequently enough to prevent it from rolling over twice between reads. The default Memory Bandwidth Monitoring (MBM) counter width is 24 bits. Hygon CPUs provide a 32-bit width counter, but they do not support the MBM capability CPUID leaf (0xF.[ECX=1]:EAX) to report the width offset (from 24 bits). Consequently, the kernel falls back to the 24-bit default counter width, which causes incorrect overflow handling on Hygon CPUs. Fix this by explicitly setting the counter width offset to 8 bits (resulting in a 32-bit total counter width) for Hygon CPUs. Fixes: d8df126349da ("x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper") Signed-off-by: Xiaochen Shen <shenxiaochen@open-hieco.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251209062650.1536952-3-shenxiaochen@open-hieco.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23x86/resctrl: Add missing resctrl initialization for HygonXiaochen Shen
commit 6ee98aabdc700b5705e4f1833e2edc82a826b53b upstream. Hygon CPUs supporting Platform QoS features currently undergo partial resctrl initialization through resctrl_cpu_detect() in the Hygon BSP init helper and AMD/Hygon common initialization code. However, several critical data structures remain uninitialized for Hygon CPUs in the following paths: - get_mem_config()-> __rdt_get_mem_config_amd(): rdt_resource::membw,alloc_capable hw_res::num_closid - rdt_init_res_defs()->rdt_init_res_defs_amd(): rdt_resource::cache hw_res::msr_base,msr_update Add the missing AMD/Hygon common initialization to ensure proper Platform QoS functionality on Hygon CPUs. Fixes: d8df126349da ("x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper") Signed-off-by: Xiaochen Shen <shenxiaochen@open-hieco.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251209062650.1536952-2-shenxiaochen@open-hieco.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23x86/kaslr: Recognize all ZONE_DEVICE users as physaddr consumersDan Williams
commit 269031b15c1433ff39e30fa7ea3ab8f0be9d6ae2 upstream. Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") is too narrow. The effect being mitigated in that commit is caused by ZONE_DEVICE which PCI_P2PDMA has a dependency. ZONE_DEVICE, in general, lets any physical address be added to the direct-map. I.e. not only ACPI hotplug ranges, CXL Memory Windows, or EFI Specific Purpose Memory, but also any PCI MMIO range for the DEVICE_PRIVATE and PCI_P2PDMA cases. Update the mitigation, limit KASLR entropy, to apply in all ZONE_DEVICE=y cases. Distro kernels typically have PCI_P2PDMA=y, so the practical exposure of this problem is limited to the PCI_P2PDMA=n case. A potential path to recover entropy would be to walk ACPI and determine the limits for hotplug and PCI MMIO before kernel_randomize_memory(). On smaller systems that could yield some KASLR address bits. This needs additional investigation to determine if some limited ACPI table scanning can happen this early without an open coded solution like arch/x86/boot/compressed/acpi.c needs to deploy. Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Tested-by: Yasunori Goto <y-goto@fujitsu.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://patch.msgid.link/692e08b2516d4_261c1100a3@dwillia2-mobl4.notmuch Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>