| Age | Commit message (Collapse) | Author |
|
commit 87d0f901a9bd8ae6be57249c737f20ac0cace93d upstream.
Explicitly set/clear CR8 write interception when AVIC is (de)activated to
fix a bug where KVM leaves the interception enabled after AVIC is
activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8
will remain intercepted in perpetuity.
On its own, the dangling CR8 intercept is "just" a performance issue, but
combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM:
Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging
intercept is fatal to Windows guests as the TPR seen by hardware gets
wildly out of sync with reality.
Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored
when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in
KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this
is firmly an SVM implementation flaw/detail.
WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should
never enter the guest with AVIC enabled and CR8 writes intercepted.
Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Squash fix to avic_deactivate_vmcb. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3989a6d036c8ec82c0de3614bed23a1dacd45de5 upstream.
Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled
in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the
vCPU could activate AVIC at any point in its lifecycle. Configuring the
VMCB if and only if AVIC is active "works" purely because of optimizations
in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled
*and* to defer updates until the first KVM_RUN. In quotes because KVM
likely won't do the right thing if kvm_apicv_activated() is false, i.e. if
a vCPU is created while APICv is inhibited at the VM level for whatever
reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is
handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to
vendor code due to seeing "apicv_active == activate".
Cleaning up the initialization code will also allow fixing a bug where KVM
incorrectly leaves CR8 interception enabled when AVIC is activated without
creating a mess with respect to whether AVIC is activated or not.
Cc: stable@vger.kernel.org
Fixes: 67034bb9dd5e ("KVM: SVM: Add irqchip_split() checks before enabling AVIC")
Fixes: 6c3e4422dd20 ("svm: Add support for dynamic APICv")
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When allocating the AVIC backing page, only check one of the max AVIC vs.
x2AVIC ID based on whether or not x2AVIC is enabled. Doing so fixes a bug
where KVM incorrectly inhibits AVIC if x2AVIC is _disabled_ and any vCPU
with a non-zero APIC ID is created, as x2avic_max_physical_id is left '0'
when x2AVIC is disabled.
Fixes: 940fc47cfb0d ("KVM: SVM: Add AVIC support for 4k vCPUs in x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://patch.msgid.link/20260112232805.1512361-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM SVM changes for 6.19:
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
|
|
Use a raw spinlock for vcpu_svm.ir_list_lock as the lock can be taken
during schedule() via kvm_sched_out() => __avic_vcpu_put(), and "normal"
spinlocks are sleepable locks when PREEMPT_RT=y.
This fixes the following lockdep warning:
=============================
[ BUG: Invalid wait context ]
6.12.0-146.1640_2124176644.el10.x86_64+debug #1 Not tainted
-----------------------------
qemu-kvm/38299 is trying to lock:
ff11000239725600 (&svm->ir_list_lock){....}-{3:3}, at: __avic_vcpu_put+0xfd/0x300 [kvm_amd]
other info that might help us debug this:
context-{5:5}
2 locks held by qemu-kvm/38299:
#0: ff11000239723ba8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x240/0xe00 [kvm]
#1: ff11000b906056d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2e/0x130
stack backtrace:
CPU: 1 UID: 0 PID: 38299 Comm: qemu-kvm Kdump: loaded Not tainted 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 PREEMPT(voluntary)
Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ100AB 09/14/2023
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
__lock_acquire+0x921/0xb80
lock_acquire.part.0+0xbe/0x270
_raw_spin_lock_irqsave+0x46/0x90
__avic_vcpu_put+0xfd/0x300 [kvm_amd]
svm_vcpu_put+0xfa/0x130 [kvm_amd]
kvm_arch_vcpu_put+0x48c/0x790 [kvm]
kvm_sched_out+0x161/0x1c0 [kvm]
prepare_task_switch+0x36b/0xf60
__schedule+0x4f7/0x1890
schedule+0xd4/0x260
xfer_to_guest_mode_handle_work+0x54/0xc0
vcpu_run+0x69a/0xa70 [kvm]
kvm_arch_vcpu_ioctl_run+0xdc0/0x17e0 [kvm]
kvm_vcpu_ioctl+0x39f/0xe00 [kvm]
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://patch.msgid.link/20251030194130.307900-1-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make amd_iommu_register_ga_log_notifier() a local symbol now that it's
defined and used purely within avic.c.
No functional change intended.
Fixes: 4bdec12aa8d6 ("KVM: SVM: Detect X2APIC virtualization (x2AVIC) support")
Link: https://patch.msgid.link/20251016190643.80529-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Unregister the GALog notifier (used to get notified of wake events for
blocking vCPUs) on kvm-amd.ko exit so that a KVM or IOMMU driver bug that
results in a spurious GALog event "only" results in a spurious IRQ, and
doesn't trigger a use-after-free due to executing unloaded module code.
Fixes: 5881f73757cc ("svm: Introduce AMD IOMMU avic_ga_log_notifier")
Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Closes: https://lore.kernel.org/all/20250918130320.GA119526@k08j02272.eu95sqa
Link: https://patch.msgid.link/20251016190643.80529-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
With AVIC support for 4k vCPUs, the maximum supported physical ID in
x2AVIC mode is 4095. Since this is no longer fixed, introduce a variable
(x2avic_max_physical_id) to capture the maximum supported physical ID on
the current platform and use that in place of the existing macro
(X2AVIC_MAX_PHYSICAL_ID).
With AVIC support for 4k vCPUs, the AVIC Physical ID table is no
longer a single page and can occupy up to 8 contiguous 4k pages. Since
AVIC hardware accesses of the physical ID table are limited by the
physical max index programmed in the VMCB, it is sufficient to allocate
only as many pages as are required to have a physical table entry for
the max guest APIC ID. Since the guest APIC mode is not available at
this point, provision for the maximum possible x2AVIC ID. For this
purpose, add a variant of avic_get_max_physical_id() that works with a
NULL vCPU pointer and returns the max x2AVIC ID. Wrap this in a new
helper for obtaining the allocation order.
To make it easy to identify support for 4k vCPUs in x2AVIC mode, update
the message printed to the kernel log to print the maximum number of
vCPUs supported. Do this on all platforms supporting x2AVIC since it is
useful to know what is supported on a specific platform.
Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/7fc5962f6da028f7dd3c79dbbd5c574fa02c99dd.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
With support for 4k vCPUs in x2AVIC, the size of the AVIC Physical ID
table is expanded from a single 4k page to a maximum of 8 contiguous 4k
pages. The actual number of pages allocated depends on the maximum
possible APIC ID in the guest, which is only known by the time the first
vCPU is created. In preparation for supporting a dynamic AVIC Physical
ID table size, move its allocation to vcpu_precreate().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/7dc764e0af7f01440bbac3d9215ed174027c2384.1757009416.git.naveen@kernel.org
[sean: drop enable_apicv check from svm_vcpu_precreate()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The lower 9-bit field in EXITINFO2 represents an index into the AVIC
Physical/Logical APIC ID table for a AVIC_INCOMPLETE_IPI #VMEXIT. Since
the index into the Logical APIC ID table is just 8 bits, this field is
actually bound by the bit-width of the index into the AVIC Physical ID
table which is represented by AVIC_PHYSICAL_MAX_INDEX_MASK. So, use that
macro to mask EXITINFO2.Index instead of hard coding 0x1FF in
avic_incomplete_ipi_interception().
Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/95795f449c68bffcb3e1789ee2b0b7393711d37d.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
To help with a future change, add a helper to look up the maximum
physical ID depending on the vCPU AVIC mode. No functional change
intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/0ab9bf5e20a3463a4aa3a5ea9bbbac66beedf1d1.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM allows VMMs to specify the maximum possible APIC ID for a virtual
machine through KVM_CAP_MAX_VCPU_ID capability so as to limit data
structures related to APIC/x2APIC. Utilize the same to set the AVIC
physical max index in the VMCB, similar to VMX. This helps hardware
limit the number of entries to be scanned in the physical APIC ID table
speeding up IPI broadcasts for virtual machines with smaller number of
vCPUs.
Unlike VMX, SVM AVIC requires a single page to be allocated for the
Physical APIC ID table and the Logical APIC ID table, so retain the
existing approach of allocating those during VM init.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/adb07ccdb3394cd79cb372ba6bcc69a4e4d4ef54.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
AVIC and x2AVIC are fully functional since Zen 4, with no known hardware
errata. Enable AVIC and x2AVIC by default on Zen4+ so long as x2AVIC is
supported (to avoid enabling partial support for APIC virtualization by
default).
Internally, convert "avic" to an integer so that KVM can identify if the
user has asked to explicitly enable or disable AVIC, i.e. so that KVM
doesn't override an explicit 'y' from the user. Arbitrarily use -1 to
denote auto-mode, and accept the string "auto" for the module param in
addition to standard boolean values, i.e. continue to allow the user to
configure the "avic" module parameter to explicitly enable/disable AVIC.
To again maintain backward compatibility with a standard boolean param,
set KERNEL_PARAM_OPS_FL_NOARG, which tells the params infrastructure to
allow empty values for %true, i.e. to interpret a bare "avic" as "avic=y".
Take care to check for a NULL @val when looking for "auto"!
Lastly, always print "avic" as a boolean, since auto-mode is resolved
during module initialization, i.e. the user should never see "auto" in
sysfs.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250919215934.1590410-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move "avic" to avic.c so that it's colocated with the other AVIC specific
globals and module params, and so that avic_hardware_setup() is a bit more
self-contained, e.g. similar to sev_hardware_setup().
Deliberately set enable_apicv in svm.c as it's already globally visible
(defined by kvm.ko, not by kvm-amd.ko), and to clearly capture the
dependency on enable_apicv being initialized (svm_hardware_setup() clears
several AVIC-specific hooks when enable_apicv is disabled).
Alternatively, clearing of the hooks (and enable_ipiv) could be moved to
avic_hardware_setup(), but that's not obviously better, e.g. it's helpful
to isolate the setting of enable_apicv when reading code from the generic
x86 side of the world.
No functional change intended.
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't advise the end user to try to force enable AVIC when x2AVIC is
reported as supported in CPUID, as forcefully enabling AVIC isn't something
that should be done lightly. E.g. some Zen4 client systems hide AVIC but
leave x2AVIC behind, and while such a configuration is indeed due to buggy
firmware in the sense the reporting x2AVIC without AVIC is nonsensical,
KVM has no idea _why_ firmware disabled AVIC in the first place.
Suggesting that the user try to run with force_avic=y is sketchy even if
the user explicitly tries to enable AVIC, and will be downright
irresponsible once KVM starts enabling AVIC by default. Alternatively,
KVM could print the message only when the user explicitly asks for AVIC,
but running with force_avic=y isn't something that should be encouraged
for random users. force_avic is a useful knob for developers and perhaps
even advanced users, but isn't something that KVM should advertise broadly.
Opportunistically append a newline to the pr_warn() so that it prints out
immediately, and tweak the message to say that AVIC is unsupported instead
of disabled (disabled suggests that the kernel/KVM is somehow responsible).
Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Print the customary "AVIC enabled" informational message even when AVIC is
force enabled on a system that doesn't advertise supported for AVIC in
CPUID, as not printing the standard message can confuse users and tools.
Opportunistically clean up the scary message when AVIC is force enabled,
but keep it as separate message so that it is printed at level "warn",
versus the standard message only being printed for level "info".
Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set the "allow_apicv_in_x2apic_without_x2apic_virtualization" flag as part
of avic_hardware_setup() instead of handling in svm_hardware_setup(), and
make x2avic_enabled local to avic.c (setting the flag was the only use in
svm.c).
Tag avic_hardware_setup() with __init as necessary (it should have been
tagged __init long ago).
No functional change intended (aside from the side effects of tagging
avic_hardware_setup() with __init).
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move svm_set_x2apic_msr_interception() to avic.c as it's only relevant
when x2AVIC is enabled/supported and only called by AVIC code. In
addition to scoping AVIC code to avic.c, this will allow burying the
global x2avic_enabled variable in avic.
Opportunistically rename the helper to explicitly scope it to "avic".
No functional change intended.
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Configure IRTEs to GA log interrupts for device posted IRQs that hit
non-running vCPUs if and only if the target vCPU is blocking, i.e.
actually needs a wake event. If the vCPU has exited to userspace or was
preempted, generating GA log entries and interrupts is wasteful and
unnecessary, as the vCPU will be re-loaded and/or scheduled back in
irrespective of the GA log notification (avic_ga_log_notifier() is just a
fancy wrapper for kvm_vcpu_wake_up()).
Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to
track whether or not the vCPU's associated IRTEs are configured to
generate GA logs, but only set the synthetic bit in KVM's "cache", i.e.
never set the should-be-zero bit in tables that are used by hardware.
Use a synthetic bit instead of a dedicated boolean to minimize the odds
of messing up the locking, i.e. so that all the existing rules that apply
to avic_physical_id_entry for IS_RUNNING are reused verbatim for
GA_LOG_INTR.
Note, because KVM (by design) "puts" AVIC state in a "pre-blocking"
phase, using kvm_vcpu_is_blocking() to track the need for notifications
isn't a viable option.
Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts. KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.
Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.
Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().
Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.
Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into
__avic_vcpu_{load,put}(), and add a param to the helpers to communicate
whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update,
or just a quick update to set the CPU and IsRun.
Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't query a vCPU's blocking status when toggling AVIC on/off; barring
KVM bugs, the vCPU can't be blocking when refreshing AVIC controls. And
if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in
the correct state is desirable, i.e. well worth any overhead in a buggy
scenario.
Isolating the "real" load/put flows will allow moving the IOMMU IRTE
(de)activation logic from avic_refresh_apicv_exec_ctrl() to
avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's
physical ID entry and its IRTEs in a common path, under a single critical
section of ir_list_lock.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in
anticipation of moving the __avic_vcpu_{load,put}() calls into the
critical section, and because having a one-off helper with a name that's
easily confused with avic_pi_update_irte() is unnecessary.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to
find and kick vCPUs when a device posted interrupt serves as a wake event.
Lookups on a vCPU index are O(fast) (not sure what xa_load() actually
provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match
its index.
Unlike the Physical APIC Table, which is accessed by hardware when
virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_
use APIC IDs to fill the Physical APIC Table, but KVM has free rein over
the format/meaning of the GA tag.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is
freed with ir_list entries, i.e. if KVM leaves a dangling IRTE.
Initialize the per-vCPU interrupt remapping list and its lock even if AVIC
is disabled so that the WARN doesn't hit false positives (and so that KVM
doesn't need to call into AVIC code for a simple sanity check).
Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN if updating GA information for an IRTE entry fails as modifying an
IRTE should only fail if KVM is buggy, e.g. has stale metadata, and
because returning an error that is always ignored is pointless.
Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating IRTE GA fields, keep processing all other IRTEs if an update
fails, as not updating later entries risks making a bad situation worse.
Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN if (de)activating "guest mode" for an IRTE entry fails as modifying
an IRTE should only fail if KVM is buggy, e.g. has stale metadata.
Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't short-circuit IRTE updating when (de)activating AVIC based on the
VM having assigned devices, as nothing prevents AVIC (de)activation from
racing with device (de)assignment. And from a performance perspective,
bailing early when there is no assigned device doesn't add much, as
ir_list_lock will never be contended if there's no assigned device.
Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't bother checking if a VM has an assigned device when updating AVIC
vCPU affinity, querying ir_list is just as cheap and nothing prevents
racing with changes in device assignment.
Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
inhibited
If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode. KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.
Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity. Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.
This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.
Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running. amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that svm_ir_list_add() isn't overloaded with all manner of weird
things, fold it into avic_pi_update_irte(), and more importantly take
ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
that's shoved into the IRTE is fresh. While preemption (and IRQs) is
disabled on the task performing the IRTE update, thanks to irqfds.lock,
that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
is irrelevant.
Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up
and doesn't provide the necessary metadata. Returning an error up the
stack without actually handling the error is useless and confusing.
Link: https://lore.kernel.org/r/20250611224604.313496-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted
IRQ, in common x86 code. This will allow for additional consolidation of
the SVM and VMX code.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures. Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Clean up the return paths for avic_pi_update_irte() now that the
refactoring dust has settled.
Opportunistically drop the pr_err() on IRTE update failures. Logging that
a failure occurred without _any_ context is quite useless.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the pi_irte_update tracepoint to common x86, and call it whenever the
IRTE is modified. Tracing only the modifications that result in an IRQ
being posted to a vCPU makes the tracepoint useless for debugging.
Drop the vendor specific address; plumbing that into common code isn't
worth the trouble, as the address is meaningless without a whole pile of
other information that isn't provided in any tracepoint.
Link: https://lore.kernel.org/r/20250611224604.313496-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Hoist the logic for identifying the target vCPU for a posted interrupt
into common x86. The code is functionally identical between Intel and
AMD.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Genericize SVM's get_pi_vcpu_info() so that it can be shared with VMX.
The only SVM specific information it provides is the AVIC back page, and
that can be trivially retrieved by its sole caller.
No functional change intended.
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM explicitly passes the new/current GSI routing to
pi_update_irte(), simply use the provided routing entry and stop walking
the routing table to find that entry. KVM, via setup_routing_entry() and
sanity checked by kvm_get_msi_route(), disallows having a GSI configured
to trigger multiple MSIs.
I.e. this is subtly a glorified nop, as KVM allows at most one MSI per
GSI, the for-loop can only ever process one entry, and that entry is the
new/current entry (see the WARN_ON_ONCE() added by "KVM: x86: Pass new
routing entries and irqfd when updating IRTEs" to ensure @new matches the
entry found in the routing table).
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a comment to explain why KVM clears IsRunning when putting a vCPU,
even though leaving IsRunning=1 would be ok from a functional perspective.
Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
fast as to induce a 50%+ loss in performance.
Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as
hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR
write emulation, and might fail to VM-Exit on the sending vCPU, if
IsRunning was recently cleared.
The absence of the VM-Exit leads to KVM not waking (or triggering nested
VM-Exit of) the target vCPU(s) of the IPI, which can lead to hung vCPUs,
unbounded delays in L2 execution, etc.
To workaround the erratum, simply disable IPI virtualization, which
prevents KVM from setting IsRunning and thus eliminates the race where
hardware sees a stale IsRunning=1. As a result, all ICR writes (except
when "Self" shorthand is used) will VM-Exit and therefore be correctly
emulated by KVM.
Disabling IPI virtualization does carry a performance penalty, but
benchmarkng shows that enabling AVIC without IPI virtualization is still
much better than not using AVIC at all, because AVIC still accelerates
posted interrupts and the receiving end of the IPIs.
Note, when virtualizing Self-IPIs, the CPU skips reading the physical ID
table and updates the vIRR directly (because the vCPU is by definition
actively running), i.e. Self-IPI isn't susceptible to the erratum *and*
is still accelerated by hardware.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: rebase, massage changelog, disallow user override]
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
module param, by never setting IsRunning. SVM doesn't provide a way to
disable IPI virtualization in hardware, but by ensuring CPUs never see
IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
VM-Exit.
To avoid setting the real IsRunning bit, while still allowing KVM to use
each vCPU's entry to update GA log entries, simply maintain a shadow of
the entry, without propagating IsRunning updates to the real table when
IPI virtualization is disabled.
Providing a way to effectively disable IPI virtualization will allow KVM
to safely enable AVIC on hardware that is susceptible to erratum #1235,
which causes hardware to sometimes fail to detect that the IsRunning bit
has been cleared by software.
Note, the table _must_ be fully populated, as broadcast IPIs skip invalid
entries, i.e. won't generate VM-Exit if every entry is invalid, and so
simply pointing the VMCB at a common dummy table won't work.
Alternatively, KVM could allocate a shadow of the entire table, but that'd
be a waste of 4KiB since the per-vCPU entry doesn't actually consume an
additional 8 bytes of memory (vCPU structures are large enough that they
are backed by order-N pages).
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: keep "entry" variables, reuse enable_ipiv, split from erratum]
Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index
the table directly. Caching a pointer address is completely unnecessary
for performance, and while the field technically caches the result of the
pointer calculation, it's all too easy to misinterpret the name and think
that the field somehow caches the _data_ in the table.
No functional change intended.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allocate and track AVIC's logical and physical tables as u32 and u64
pointers respectively, as managing the pages as "struct page" pointers
adds an almost absurd amount of boilerplate and complexity. E.g. with
page_address() out of the way, svm->avic_physical_id_cache becomes
completely superfluous, and will be removed in a future cleanup.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop avic_get_physical_id_entry()'s compatibility check on the incoming
ID, as its sole caller, avic_init_backing_page(), performs the exact same
check. Drop avic_get_physical_id_entry() entirely as the only remaining
functionality is getting the address of the Physical ID table, and
accessing the array without an immediate bounds check is kludgy.
Opportunistically add a compile-time assertion to ensure the vcpu_id can't
result in a bounds overflow, e.g. if KVM (really) messed up a maximum
physical ID #define, as well as run-time assertions so that a NULL pointer
dereference is morphed into a safer WARN().
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|