kernel/arch/x86/kvm/vmx/vmx.h, branch linux-rolling-lts

KVM: nVMX: Prepare for enabling CET support for nested guest

2025-09-23T16:24:30Z

Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting to enable CET for nested VM. vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants to resume L2, that way correct CET states can be observed by one another. Please note that consistency checks regarding CET state during VM-Entry will be added later to prevent this patch from becoming too large. Advertising the new CET VM_ENTRY/EXIT control bits are also be deferred until after the consistency checks are added. Signed-off-by: Yang Weijiang Tested-by: Mathias Krause Tested-by: John Allen Tested-by: Rick Edgecombe Signed-off-by: Chao Gao Reviewed-by: Xin Li (Intel) Tested-by: Xin Li (Intel) Link: https://lore.kernel.org/r/20250919223258.1604852-32-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: x86: Enable CET virtualization for VMX and advertise to userspace

2025-09-23T16:22:32Z

Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the CET XFEATURE bits in XSS, and advertise support for IBT and SHSTK to userspace. Explicitly clear IBT and SHSTK onn SVM, as additional work is needed to enable CET on SVM, e.g. to context switch S_CET and other state. Disable KVM CET feature if unrestricted_guest is unsupported/disabled as KVM does not support emulating CET, as running without Unrestricted Guest can result in KVM emulating large swaths of guest code. While it's highly unlikely any guest will trigger emulation while also utilizing IBT or SHSTK, there's zero reason to allow CET without Unrestricted Guest as that combination should only be possible when explicitly disabling unrestricted_guest for testing purposes. Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces the presence of an Error Code based on exception vector, as attempting to inject a #CP with an Error Code (#CP architecturally has an Error Code) will fail due to the #CP vector historically not having an Error Code. Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT. Note, KVM already clears guest CET state that is managed via XSTATE in kvm_xstate_reset(). Signed-off-by: Yang Weijiang Signed-off-by: Mathias Krause Tested-by: Mathias Krause Tested-by: John Allen Tested-by: Rick Edgecombe Signed-off-by: Chao Gao [sean: move some bits to separate patches, massage changelog] Reviewed-by: Binbin Wu Reviewed-by: Xiaoyao Li Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: VMX: Add helpers to toggle/change a bit in VMCS execution controls

2025-09-18T19:57:19Z

Expand the VMCS controls builder macros to generate helpers to change a bit to the desired value, and use the new helpers when toggling APICv related controls. No functional change intended. Suggested-by: Sean Christopherson Signed-off-by: Dapeng Mi Signed-off-by: Mingwei Zhang [sean: rewrite changelog] Tested-by: Xudong Hao Link: https://lore.kernel.org/r/20250806195706.1650976-27-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: x86: Add support for RDMSR/WRMSRNS w/ immediate on Intel

2025-08-19T18:59:46Z

Add support for the immediate forms of RDMSR and WRMSRNS (currently Intel-only). The immediate variants are only valid in 64-bit mode, and use a single general purpose register for the data (the register is also encoded in the instruction, i.e. not implicit like regular RDMSR/WRMSR). The immediate variants are primarily motivated by performance, not code size: by having the MSR index in an immediate, it is available *much* earlier in the CPU pipeline, which allows hardware much more leeway about how a particular MSR is handled. Intel VMX support for the immediate forms of MSR accesses communicates exit information to the host as follows: 1) The immediate form of RDMSR uses VM-Exit Reason 84. 2) The immediate form of WRMSRNS uses VM-Exit Reason 85. 3) For both VM-Exit reasons 84 and 85, the Exit Qualification field is set to the MSR index that triggered the VM-Exit. 4) Bits 3 ~ 6 of the VM-Exit Instruction Information field are set to the register encoding used by the immediate form of the instruction, i.e. the destination register for RDMSR, and the source for WRMSRNS. 5) The VM-Exit Instruction Length field records the size of the immediate form of the MSR instruction. To deal with userspace RDMSR exits, stash the destination register in a new kvm_vcpu_arch field, similar to cui_linear_rip, pio, etc. Alternatively, the register could be saved in kvm_run.msr or re-retrieved from the VMCS, but the former would require sanitizing the value to ensure userspace doesn't clobber the value to an out-of-bounds index, and the latter would require a new one-off kvm_x86_ops hook. Don't bother adding support for the instructions in KVM's emulator, as the only way for RDMSR/WRMSR to be encountered is if KVM is emulating large swaths of code due to invalid guest state, and a vCPU cannot have invalid guest state while in 64-bit mode. Signed-off-by: Xin Li (Intel) [sean: minor tweaks, massage and expand changelog] Link: https://lore.kernel.org/r/20250805202224.1475590-5-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: VMX: Add a macro to track which DEBUGCTL bits are host-owned

2025-07-09T16:30:52Z

Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e. need to be preserved when running the guest, to dedup the logic without having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL. No functional change intended. Suggested-by: Maxim Levitsky Reviewed-by: Maxim Levitsky Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com Signed-off-by: Sean Christopherson

KVM: x86: Deduplicate MSR interception enabling and disabling

2025-06-24T22:42:12Z

Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao Reviewed-by: Dapeng Mi Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson

KVM: VMX: Manually recalc all MSR intercepts on userspace MSR filter change

2025-06-20T20:07:28Z

On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM *must* be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Borislav Petkov Reviewed-by: Chao Gao Reviewed-by: Xin Li (Intel) Reviewed-by: Dapeng Mi Reviewed-by: Binbin Wu Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: x86: Move definition of X2APIC_MSR() to lapic.h

2025-06-20T20:07:28Z

Dedup the definition of X2APIC_MSR and put it in the local APIC code where it belongs. No functional change intended. Reviewed-by: Dapeng Mi Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest

2025-06-20T20:05:24Z

Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting while running the guest. When running with the "default treatment of SMIs" in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that is visible to host (non-SMM) software, and instead transitions directly from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched by hardware on SMI or RSM, i.e. SMM will run with whatever value was resident in hardware at the time of the SMI. Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting events while the CPU is executing in SMM, which can pollute profiling and potentially leak information into the guest. Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner run loop, as the bit can be toggled in IRQ context via IPI callback (SMP function call), by way of /sys/devices/cpu/freeze_on_smi. Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs, i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and at worst could lead to undesirable behavior in the future if AMD CPUs ever happened to pick up a collision with the bit. Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module owns and controls GUEST_IA32_DEBUGCTL. WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the lack of handling isn't a KVM bug (TDX already WARNs on any run_flag). Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state(). Doing so avoids the need to track host_debugctl on a per-VMCS basis, as GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't have actually entered the guest, vcpu_enter_guest() will have run with vmcs02 active and thus could result in vmcs01 being run with a stale value. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky Co-developed-by: Sean Christopherson Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com Signed-off-by: Sean Christopherson

KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs

2025-06-20T20:05:24Z

Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state into the guest, and without needing to copy+paste the FREEZE_IN_SMM logic into every patch that accesses GUEST_IA32_DEBUGCTL. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky [sean: massage changelog, make inline, use in all prepare_vmcs02() cases] Reviewed-by: Dapeng Mi Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com Signed-off-by: Sean Christopherson