summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm/avic.c
AgeCommit message (Collapse)Author
2026-03-19KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activatedSean Christopherson
[ Upstream commit 87d0f901a9bd8ae6be57249c737f20ac0cace93d ] Explicitly set/clear CR8 write interception when AVIC is (de)activated to fix a bug where KVM leaves the interception enabled after AVIC is activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8 will remain intercepted in perpetuity. On its own, the dangling CR8 intercept is "just" a performance issue, but combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging intercept is fatal to Windows guests as the TPR seen by hardware gets wildly out of sync with reality. Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this is firmly an SVM implementation flaw/detail. WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should never enter the guest with AVIC enabled and CR8 writes intercepted. Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Cc: Naveen N Rao (AMD) <naveen@kernel.org> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> [Squash fix to avic_deactivate_vmcb. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19KVM: SVM: Add a helper to look up the max physical ID for AVICNaveen N Rao
[ Upstream commit f2f6e67a56dc88fea7e9b10c4e79bb01d97386b7 ] To help with a future change, add a helper to look up the maximum physical ID depending on the vCPU AVIC mode. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/0ab9bf5e20a3463a4aa3a5ea9bbbac66beedf1d1.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 87d0f901a9bd ("KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19KVM: SVM: Limit AVIC physical max index based on configured max_vcpu_idsNaveen N Rao
[ Upstream commit 574ef752d4aea04134bc121294d717f4422c2755 ] KVM allows VMMs to specify the maximum possible APIC ID for a virtual machine through KVM_CAP_MAX_VCPU_ID capability so as to limit data structures related to APIC/x2APIC. Utilize the same to set the AVIC physical max index in the VMCB, similar to VMX. This helps hardware limit the number of entries to be scanned in the physical APIC ID table speeding up IPI broadcasts for virtual machines with smaller number of vCPUs. Unlike VMX, SVM AVIC requires a single page to be allocated for the Physical APIC ID table and the Logical APIC ID table, so retain the existing approach of allocating those during VM init. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/adb07ccdb3394cd79cb372ba6bcc69a4e4d4ef54.1757009416.git.naveen@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 87d0f901a9bd ("KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-19KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APICSean Christopherson
commit 3989a6d036c8ec82c0de3614bed23a1dacd45de5 upstream. Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the vCPU could activate AVIC at any point in its lifecycle. Configuring the VMCB if and only if AVIC is active "works" purely because of optimizations in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled *and* to defer updates until the first KVM_RUN. In quotes because KVM likely won't do the right thing if kvm_apicv_activated() is false, i.e. if a vCPU is created while APICv is inhibited at the VM level for whatever reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to vendor code due to seeing "apicv_active == activate". Cleaning up the initialization code will also allow fixing a bug where KVM incorrectly leaves CR8 interception enabled when AVIC is activated without creating a mess with respect to whether AVIC is activated or not. Cc: stable@vger.kernel.org Fixes: 67034bb9dd5e ("KVM: SVM: Add irqchip_split() checks before enabling AVIC") Fixes: 6c3e4422dd20 ("svm: Add support for dynamic APICv") Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-04KVM: SVM: switch to raw spinlock for svm->ir_list_lockMaxim Levitsky
Use a raw spinlock for vcpu_svm.ir_list_lock as the lock can be taken during schedule() via kvm_sched_out() => __avic_vcpu_put(), and "normal" spinlocks are sleepable locks when PREEMPT_RT=y. This fixes the following lockdep warning: ============================= [ BUG: Invalid wait context ] 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 Not tainted ----------------------------- qemu-kvm/38299 is trying to lock: ff11000239725600 (&svm->ir_list_lock){....}-{3:3}, at: __avic_vcpu_put+0xfd/0x300 [kvm_amd] other info that might help us debug this: context-{5:5} 2 locks held by qemu-kvm/38299: #0: ff11000239723ba8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x240/0xe00 [kvm] #1: ff11000b906056d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2e/0x130 stack backtrace: CPU: 1 UID: 0 PID: 38299 Comm: qemu-kvm Kdump: loaded Not tainted 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 PREEMPT(voluntary) Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ100AB 09/14/2023 Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 __lock_acquire+0x921/0xb80 lock_acquire.part.0+0xbe/0x270 _raw_spin_lock_irqsave+0x46/0x90 __avic_vcpu_put+0xfd/0x300 [kvm_amd] svm_vcpu_put+0xfa/0x130 [kvm_amd] kvm_arch_vcpu_put+0x48c/0x790 [kvm] kvm_sched_out+0x161/0x1c0 [kvm] prepare_task_switch+0x36b/0xf60 __schedule+0x4f7/0x1890 schedule+0xd4/0x260 xfer_to_guest_mode_handle_work+0x54/0xc0 vcpu_run+0x69a/0xa70 [kvm] kvm_arch_vcpu_ioctl_run+0xdc0/0x17e0 [kvm] kvm_vcpu_ioctl+0x39f/0xe00 [kvm] Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://patch.msgid.link/20251030194130.307900-1-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: Make avic_ga_log_notifier() local to avic.cSean Christopherson
Make amd_iommu_register_ga_log_notifier() a local symbol now that it's defined and used purely within avic.c. No functional change intended. Fixes: 4bdec12aa8d6 ("KVM: SVM: Detect X2APIC virtualization (x2AVIC) support") Link: https://patch.msgid.link/20251016190643.80529-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04KVM: SVM: Unregister KVM's GALog notifier on kvm-amd.ko exitSean Christopherson
Unregister the GALog notifier (used to get notified of wake events for blocking vCPUs) on kvm-amd.ko exit so that a KVM or IOMMU driver bug that results in a spurious GALog event "only" results in a spurious IRQ, and doesn't trigger a use-after-free due to executing unloaded module code. Fixes: 5881f73757cc ("svm: Introduce AMD IOMMU avic_ga_log_notifier") Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Closes: https://lore.kernel.org/all/20250918130320.GA119526@k08j02272.eu95sqa Link: https://patch.msgid.link/20251016190643.80529-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is supportNaveen N Rao
AVIC and x2AVIC are fully functional since Zen 4, with no known hardware errata. Enable AVIC and x2AVIC by default on Zen4+ so long as x2AVIC is supported (to avoid enabling partial support for APIC virtualization by default). Internally, convert "avic" to an integer so that KVM can identify if the user has asked to explicitly enable or disable AVIC, i.e. so that KVM doesn't override an explicit 'y' from the user. Arbitrarily use -1 to denote auto-mode, and accept the string "auto" for the module param in addition to standard boolean values, i.e. continue to allow the user to configure the "avic" module parameter to explicitly enable/disable AVIC. To again maintain backward compatibility with a standard boolean param, set KERNEL_PARAM_OPS_FL_NOARG, which tells the params infrastructure to allow empty values for %true, i.e. to interpret a bare "avic" as "avic=y". Take care to check for a NULL @val when looking for "auto"! Lastly, always print "avic" as a boolean, since auto-mode is resolved during module initialization, i.e. the user should never see "auto" in sysfs. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919215934.1590410-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Move global "avic" variable to avic.cSean Christopherson
Move "avic" to avic.c so that it's colocated with the other AVIC specific globals and module params, and so that avic_hardware_setup() is a bit more self-contained, e.g. similar to sev_hardware_setup(). Deliberately set enable_apicv in svm.c as it's already globally visible (defined by kvm.ko, not by kvm-amd.ko), and to clearly capture the dependency on enable_apicv being initialized (svm_hardware_setup() clears several AVIC-specific hooks when enable_apicv is disabled). Alternatively, clearing of the hooks (and enable_ipiv) could be moved to avic_hardware_setup(), but that's not obviously better, e.g. it's helpful to isolate the setting of enable_apicv when reading code from the generic x86 side of the world. No functional change intended. Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Don't advise the user to do force_avic=y (when x2AVIC is detected)Sean Christopherson
Don't advise the end user to try to force enable AVIC when x2AVIC is reported as supported in CPUID, as forcefully enabling AVIC isn't something that should be done lightly. E.g. some Zen4 client systems hide AVIC but leave x2AVIC behind, and while such a configuration is indeed due to buggy firmware in the sense the reporting x2AVIC without AVIC is nonsensical, KVM has no idea _why_ firmware disabled AVIC in the first place. Suggesting that the user try to run with force_avic=y is sketchy even if the user explicitly tries to enable AVIC, and will be downright irresponsible once KVM starts enabling AVIC by default. Alternatively, KVM could print the message only when the user explicitly asks for AVIC, but running with force_avic=y isn't something that should be encouraged for random users. force_avic is a useful knob for developers and perhaps even advanced users, but isn't something that KVM should advertise broadly. Opportunistically append a newline to the pr_warn() so that it prints out immediately, and tweak the message to say that AVIC is unsupported instead of disabled (disabled suggests that the kernel/KVM is somehow responsible). Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Always print "AVIC enabled" separately, even when force enabledSean Christopherson
Print the customary "AVIC enabled" informational message even when AVIC is force enabled on a system that doesn't advertise supported for AVIC in CPUID, as not printing the standard message can confuse users and tools. Opportunistically clean up the scary message when AVIC is force enabled, but keep it as separate message so that it is printed at level "warn", versus the standard message only being printed for level "info". Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Update "APICv in x2APIC without x2AVIC" in avic.c, not svm.cSean Christopherson
Set the "allow_apicv_in_x2apic_without_x2apic_virtualization" flag as part of avic_hardware_setup() instead of handling in svm_hardware_setup(), and make x2avic_enabled local to avic.c (setting the flag was the only use in svm.c). Tag avic_hardware_setup() with __init as necessary (it should have been tagged __init long ago). No functional change intended (aside from the side effects of tagging avic_hardware_setup() with __init). Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Move x2AVIC MSR interception helper to avic.cSean Christopherson
Move svm_set_x2apic_msr_interception() to avic.c as it's only relevant when x2AVIC is enabled/supported and only called by AVIC code. In addition to scoping AVIC code to avic.c, this will allow burying the global x2avic_enabled variable in avic. Opportunistically rename the helper to explicitly scope it to "avic". No functional change intended. Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blockingSean Christopherson
Configure IRTEs to GA log interrupts for device posted IRQs that hit non-running vCPUs if and only if the target vCPU is blocking, i.e. actually needs a wake event. If the vCPU has exited to userspace or was preempted, generating GA log entries and interrupts is wasteful and unnecessary, as the vCPU will be re-loaded and/or scheduled back in irrespective of the GA log notification (avic_ga_log_notifier() is just a fancy wrapper for kvm_vcpu_wake_up()). Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to track whether or not the vCPU's associated IRTEs are configured to generate GA logs, but only set the synthetic bit in KVM's "cache", i.e. never set the should-be-zero bit in tables that are used by hardware. Use a synthetic bit instead of a dedicated boolean to minimize the odds of messing up the locking, i.e. so that all the existing rules that apply to avic_physical_id_entry for IS_RUNNING are reused verbatim for GA_LOG_INTR. Note, because KVM (by design) "puts" AVIC state in a "pre-blocking" phase, using kvm_vcpu_is_blocking() to track the need for notifications isn't a viable option. Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Allow KVM to control need for GA log interruptsSean Christopherson
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or not an IRTE is configured to generate GA log interrupts. KVM only needs a notification if the target vCPU is blocking, so the vCPU can be awakened. If a vCPU is preempted or exits to userspace, KVM clears is_run, but will set the vCPU back to running when userspace does KVM_RUN and/or the vCPU task is scheduled back in, i.e. KVM doesn't need a notification. Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes from the KVM changes insofar as possible. Opportunistically swap the ordering of parameters for amd_iommu_update_ga() so that the match amd_iommu_activate_guest_mode(). Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as a non-cached field, but per AMD hardware architects, it's not cached and can be safely updated without an invalidation. Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Consolidate IRTE update when toggling AVIC on/offSean Christopherson
Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into __avic_vcpu_{load,put}(), and add a param to the helpers to communicate whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update, or just a quick update to set the CPU and IsRun. Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/offSean Christopherson
Don't query a vCPU's blocking status when toggling AVIC on/off; barring KVM bugs, the vCPU can't be blocking when refreshing AVIC controls. And if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in the correct state is desirable, i.e. well worth any overhead in a buggy scenario. Isolating the "real" load/put flows will allow moving the IOMMU IRTE (de)activation logic from avic_refresh_apicv_exec_ctrl() to avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's physical ID entry and its IRTEs in a common path, under a single critical section of ir_list_lock. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Fold avic_set_pi_irte_mode() into its sole callerSean Christopherson
Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in anticipation of moving the __avic_vcpu_{load,put}() calls into the critical section, and because having a one-off helper with a name that's easily confused with avic_pi_update_irte() is unnecessary. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadataSean Christopherson
Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to find and kick vCPUs when a device posted interrupt serves as a wake event. Lookups on a vCPU index are O(fast) (not sure what xa_load() actually provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match its index. Unlike the Physical APIC Table, which is accessed by hardware when virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_ use APIC IDs to fill the Physical APIC Table, but KVM has free rein over the format/meaning of the GA tag. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: WARN if ir_list is non-empty at vCPU freeSean Christopherson
Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is freed with ir_list entries, i.e. if KVM leaves a dangling IRTE. Initialize the per-vCPU interrupt remapping list and its lock even if AVIC is disabled so that the WARN doesn't hit false positives (and so that KVM doesn't need to call into AVIC code for a simple sanity check). Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: WARN if updating IRTE GA fields in IOMMU failsSean Christopherson
WARN if updating GA information for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata, and because returning an error that is always ignored is pointless. Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Process all IRTEs on affinity change even if one update failsSean Christopherson
When updating IRTE GA fields, keep processing all other IRTEs if an update fails, as not updating later entries risks making a bad situation worse. Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: WARN if (de)activating guest mode in IOMMU failsSean Christopherson
WARN if (de)activating "guest mode" for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata. Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Don't check for assigned device(s) when activating AVICSean Christopherson
Don't short-circuit IRTE updating when (de)activating AVIC based on the VM having assigned devices, as nothing prevents AVIC (de)activation from racing with device (de)assignment. And from a performance perspective, bailing early when there is no assigned device doesn't add much, as ir_list_lock will never be contended if there's no assigned device. Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Don't check for assigned device(s) when updating affinitySean Christopherson
Don't bother checking if a VM has an assigned device when updating AVIC vCPU affinity, querying ir_list is just as cheap and nothing prevents racing with changes in device assignment. Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is ↵Sean Christopherson
inhibited If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave the actual IRTE in remapped mode. KVM already handles the case where AVIC is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()), but doesn't handle the scenario where a postable IRQ comes along while AVIC is inhibited. Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinitySean Christopherson
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that avic_physical_id_entry can be safely accessed, set the pCPU info straight-away when setting vCPU affinity. Putting the IRTE into posted mode, and then immediately updating the IRTE a second time if the target vCPU is running is wasteful and confusing. This also fixes a flaw where a posted IRQ that arrives between putting the IRTE into guest_mode and setting the correct destination could cause the IOMMU to ring the doorbell on the wrong pCPU. Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destinationSean Christopherson
Infer whether or not a vCPU should be marked running from the validity of the pCPU on which it is running. amd_iommu_update_ga() already skips the IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an invalid pCPU would be a blatant and egregrious KVM bug. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMUSean Christopherson
Now that svm_ir_list_add() isn't overloaded with all manner of weird things, fold it into avic_pi_update_irte(), and more importantly take ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info that's shoved into the IRTE is fresh. While preemption (and IRQs) is disabled on the task performing the IRTE update, thanks to irqfds.lock, that task doesn't hold the vCPU's mutex, i.e. preemption being disabled is irrelevant. Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadataSean Christopherson
Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up and doesn't provide the necessary metadata. Returning an error up the stack without actually handling the error is useless and confusing. Link: https://lore.kernel.org/r/20250611224604.313496-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Track irq_bypass_vcpu in common x86 codeSean Christopherson
Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted IRQ, in common x86 code. This will allow for additional consolidation of the SVM and VMX code. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structsSean Christopherson
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU drivers into vendor specific structures. Overloading a single structure makes the code hard to read and maintain, is *very* misleading as it suggests that mixing vendors is actually supported, and bastardizing Intel's posted interrupt descriptor address when AMD's IOMMU already has its own structure is quite unnecessary. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Clean up return handling in avic_pi_update_irte()Sean Christopherson
Clean up the return paths for avic_pi_update_irte() now that the refactoring dust has settled. Opportunistically drop the pr_err() on IRTE update failures. Logging that a failure occurred without _any_ context is quite useless. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Move posted interrupt tracepoint to common codeSean Christopherson
Move the pi_irte_update tracepoint to common x86, and call it whenever the IRTE is modified. Tracing only the modifications that result in an IRQ being posted to a vCPU makes the tracepoint useless for debugging. Drop the vendor specific address; plumbing that into common code isn't worth the trouble, as the address is meaningless without a whole pile of other information that isn't provided in any tracepoint. Link: https://lore.kernel.org/r/20250611224604.313496-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Dedup AVIC vs. PI code for identifying target vCPUSean Christopherson
Hoist the logic for identifying the target vCPU for a posted interrupt into common x86. The code is functionally identical between Intel and AMD. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()Sean Christopherson
Genericize SVM's get_pi_vcpu_info() so that it can be shared with VMX. The only SVM specific information it provides is the AVIC back page, and that can be trivially retrieved by its sole caller. No functional change intended. Cc: Francesco Lavra <francescolavra.fl@gmail.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Stop walking list of routing table entries when updating IRTESean Christopherson
Now that KVM explicitly passes the new/current GSI routing to pi_update_irte(), simply use the provided routing entry and stop walking the routing table to find that entry. KVM, via setup_routing_entry() and sanity checked by kvm_get_msi_route(), disallows having a GSI configured to trigger multiple MSIs. I.e. this is subtly a glorified nop, as KVM allows at most one MSI per GSI, the for-loop can only ever process one entry, and that entry is the new/current entry (see the WARN_ON_ONCE() added by "KVM: x86: Pass new routing entries and irqfd when updating IRTEs" to ensure @new matches the entry found in the routing table). Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"Sean Christopherson
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a vCPU" now that there's no need to communicate information back to KVM about the previous vCPU (KVM does its own tracking). Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptrSean Christopherson
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and KVM's one and only path for writing amd_iommu_pi_data.base computes the exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base, and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is valid, i.e. amd_iommu_pi_data.base is fully redundant. Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blockingSean Christopherson
Add a comment to explain why KVM clears IsRunning when putting a vCPU, even though leaving IsRunning=1 would be ok from a functional perspective. Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so fast as to induce a 50%+ loss in performance. Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com Cc: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235Maxim Levitsky
Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR write emulation, and might fail to VM-Exit on the sending vCPU, if IsRunning was recently cleared. The absence of the VM-Exit leads to KVM not waking (or triggering nested VM-Exit of) the target vCPU(s) of the IPI, which can lead to hung vCPUs, unbounded delays in L2 execution, etc. To workaround the erratum, simply disable IPI virtualization, which prevents KVM from setting IsRunning and thus eliminates the race where hardware sees a stale IsRunning=1. As a result, all ICR writes (except when "Self" shorthand is used) will VM-Exit and therefore be correctly emulated by KVM. Disabling IPI virtualization does carry a performance penalty, but benchmarkng shows that enabling AVIC without IPI virtualization is still much better than not using AVIC at all, because AVIC still accelerates posted interrupts and the receiving end of the IPIs. Note, when virtualizing Self-IPIs, the CPU skips reading the physical ID table and updates the vIRR directly (because the vCPU is by definition actively running), i.e. Self-IPI isn't susceptible to the erratum *and* is still accelerated by hardware. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: rebase, massage changelog, disallow user override] Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add enable_ipiv param, never set IsRunning if disabledMaxim Levitsky
Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv module param, by never setting IsRunning. SVM doesn't provide a way to disable IPI virtualization in hardware, but by ensuring CPUs never see IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a VM-Exit. To avoid setting the real IsRunning bit, while still allowing KVM to use each vCPU's entry to update GA log entries, simply maintain a shadow of the entry, without propagating IsRunning updates to the real table when IPI virtualization is disabled. Providing a way to effectively disable IPI virtualization will allow KVM to safely enable AVIC on hardware that is susceptible to erratum #1235, which causes hardware to sometimes fail to detect that the IsRunning bit has been cleared by software. Note, the table _must_ be fully populated, as broadcast IPIs skip invalid entries, i.e. won't generate VM-Exit if every entry is invalid, and so simply pointing the VMCB at a common dummy table won't work. Alternatively, KVM could allocate a shadow of the entire table, but that'd be a waste of 4KiB since the per-vCPU entry doesn't actually consume an additional 8 bytes of memory (vCPU structures are large enough that they are backed by order-N pages). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: keep "entry" variables, reuse enable_ipiv, split from erratum] Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointerSean Christopherson
Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index the table directly. Caching a pointer address is completely unnecessary for performance, and while the field technically caches the result of the pointer calculation, it's all too easy to misinterpret the name and think that the field somehow caches the _data_ in the table. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"Sean Christopherson
Allocate and track AVIC's logical and physical tables as u32 and u64 pointers respectively, as managing the pages as "struct page" pointers adds an almost absurd amount of boilerplate and complexity. E.g. with page_address() out of the way, svm->avic_physical_id_cache becomes completely superfluous, and will be removed in a future cleanup. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creationSean Christopherson
Drop avic_get_physical_id_entry()'s compatibility check on the incoming ID, as its sole caller, avic_init_backing_page(), performs the exact same check. Drop avic_get_physical_id_entry() entirely as the only remaining functionality is getting the address of the Physical ID table, and accessing the array without an immediate bounds check is kludgy. Opportunistically add a compile-time assertion to ensure the vcpu_id can't result in a bounds overflow, e.g. if KVM (really) messed up a maximum physical ID #define, as well as run-time assertions so that a NULL pointer dereference is morphed into a safer WARN(). No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creationSean Christopherson
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with an ID that is too big, but otherwise allow vCPU creation to succeed. Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger than 254 (AVIC) or 511 (x2AVIC). Alternatively, KVM could advertise an accurate value depending on which AVIC mode is in use, but that wouldn't really solve the underlying problem, e.g. would be a breaking change if KVM were to ever try and enable AVIC or x2AVIC by default. Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop vcpu_svm's pointless avic_backing_page fieldSean Christopherson
Drop vcpu_svm's avic_backing_page pointer and instead grab the physical address of KVM's vAPIC page directly from the source. Getting a physical address from a kernel virtual address is not an expensive operation, and getting the physical address from a struct page is *more* expensive for CONFIG_SPARSEMEM=y kernels. Regardless, none of the paths that consume the address are hot paths, i.e. shaving cycles is not a priority. Eliminating the "cache" means KVM doesn't have to worry about the cache being invalid, which will simplify a future fix when dealing with vCPU IDs that are too big. WARN if KVM attempts to allocate a vCPU's AVIC backing page without an in-kernel local APIC. avic_init_vcpu() bails early if the APIC is not in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have been created, i.e. it should be impossible to reach avic_init_backing_page() without the vAPIC being allocated. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add helper to deduplicate code for getting AVIC backing pageSean Christopherson
Add a helper to get the physical address of the AVIC backing page, both to deduplicate code and to prepare for getting the address directly from apic->regs, at which point it won't be all that obvious that the address in question is what SVM calls the AVIC backing page. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masksSean Christopherson
Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned maximum theoretical physical address for x86-64 CPUs, as x86-64 is currently defined (going beyond PA52 would require an entirely new paging mode, which would arguably create a new, different architecture). All usage in KVM masks the result of page_to_phys(), which on x86-64 is guaranteed to be 4KiB aligned and a legal physical address; if either of those requirements doesn't hold true, KVM has far bigger problems. Drop masking the avic_backing_page with AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but keep the macro even though it's unused in functional code. It's a distinct architectural define, and having the definition in software helps visualize the layout of an entry. And to be hyper-paranoid about MAXPA going beyond 52, add a compile-time assert to ensure the kernel's maximum supported physical address stays in bounds. The unnecessary masking in avic_init_vmcb() also incorrectly assumes that SME's C-bit resides between bits 51:11; that holds true for current CPUs, but isn't required by AMD's architecture: In some implementations, the bit used may be a physical address bit Key word being "may". Opportunistically use the GENMASK_ULL() version for AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable than a set of repeating Fs. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BARSean Christopherson
Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum theoretical 4KiB-aligned physical address, i.e. is not novel in any way, and its only usage is to mask the default APIC base, which is 4KiB aligned and (obviously) a legal physical address. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>