| Age | Commit message (Collapse) | Author |
|
[ Upstream commit 87d0f901a9bd8ae6be57249c737f20ac0cace93d ]
Explicitly set/clear CR8 write interception when AVIC is (de)activated to
fix a bug where KVM leaves the interception enabled after AVIC is
activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8
will remain intercepted in perpetuity.
On its own, the dangling CR8 intercept is "just" a performance issue, but
combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM:
Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging
intercept is fatal to Windows guests as the TPR seen by hardware gets
wildly out of sync with reality.
Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored
when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in
KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this
is firmly an SVM implementation flaw/detail.
WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should
never enter the guest with AVIC enabled and CR8 writes intercepted.
Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Squash fix to avic_deactivate_vmcb. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f2f6e67a56dc88fea7e9b10c4e79bb01d97386b7 ]
To help with a future change, add a helper to look up the maximum
physical ID depending on the vCPU AVIC mode. No functional change
intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/0ab9bf5e20a3463a4aa3a5ea9bbbac66beedf1d1.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Stable-dep-of: 87d0f901a9bd ("KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 574ef752d4aea04134bc121294d717f4422c2755 ]
KVM allows VMMs to specify the maximum possible APIC ID for a virtual
machine through KVM_CAP_MAX_VCPU_ID capability so as to limit data
structures related to APIC/x2APIC. Utilize the same to set the AVIC
physical max index in the VMCB, similar to VMX. This helps hardware
limit the number of entries to be scanned in the physical APIC ID table
speeding up IPI broadcasts for virtual machines with smaller number of
vCPUs.
Unlike VMX, SVM AVIC requires a single page to be allocated for the
Physical APIC ID table and the Logical APIC ID table, so retain the
existing approach of allocating those during VM init.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/adb07ccdb3394cd79cb372ba6bcc69a4e4d4ef54.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Stable-dep-of: 87d0f901a9bd ("KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3989a6d036c8ec82c0de3614bed23a1dacd45de5 upstream.
Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled
in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the
vCPU could activate AVIC at any point in its lifecycle. Configuring the
VMCB if and only if AVIC is active "works" purely because of optimizations
in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled
*and* to defer updates until the first KVM_RUN. In quotes because KVM
likely won't do the right thing if kvm_apicv_activated() is false, i.e. if
a vCPU is created while APICv is inhibited at the VM level for whatever
reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is
handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to
vendor code due to seeing "apicv_active == activate".
Cleaning up the initialization code will also allow fixing a bug where KVM
incorrectly leaves CR8 interception enabled when AVIC is activated without
creating a mess with respect to whether AVIC is activated or not.
Cc: stable@vger.kernel.org
Fixes: 67034bb9dd5e ("KVM: SVM: Add irqchip_split() checks before enabling AVIC")
Fixes: 6c3e4422dd20 ("svm: Add support for dynamic APICv")
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Use a raw spinlock for vcpu_svm.ir_list_lock as the lock can be taken
during schedule() via kvm_sched_out() => __avic_vcpu_put(), and "normal"
spinlocks are sleepable locks when PREEMPT_RT=y.
This fixes the following lockdep warning:
=============================
[ BUG: Invalid wait context ]
6.12.0-146.1640_2124176644.el10.x86_64+debug #1 Not tainted
-----------------------------
qemu-kvm/38299 is trying to lock:
ff11000239725600 (&svm->ir_list_lock){....}-{3:3}, at: __avic_vcpu_put+0xfd/0x300 [kvm_amd]
other info that might help us debug this:
context-{5:5}
2 locks held by qemu-kvm/38299:
#0: ff11000239723ba8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x240/0xe00 [kvm]
#1: ff11000b906056d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2e/0x130
stack backtrace:
CPU: 1 UID: 0 PID: 38299 Comm: qemu-kvm Kdump: loaded Not tainted 6.12.0-146.1640_2124176644.el10.x86_64+debug #1 PREEMPT(voluntary)
Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ100AB 09/14/2023
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
__lock_acquire+0x921/0xb80
lock_acquire.part.0+0xbe/0x270
_raw_spin_lock_irqsave+0x46/0x90
__avic_vcpu_put+0xfd/0x300 [kvm_amd]
svm_vcpu_put+0xfa/0x130 [kvm_amd]
kvm_arch_vcpu_put+0x48c/0x790 [kvm]
kvm_sched_out+0x161/0x1c0 [kvm]
prepare_task_switch+0x36b/0xf60
__schedule+0x4f7/0x1890
schedule+0xd4/0x260
xfer_to_guest_mode_handle_work+0x54/0xc0
vcpu_run+0x69a/0xa70 [kvm]
kvm_arch_vcpu_ioctl_run+0xdc0/0x17e0 [kvm]
kvm_vcpu_ioctl+0x39f/0xe00 [kvm]
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://patch.msgid.link/20251030194130.307900-1-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make amd_iommu_register_ga_log_notifier() a local symbol now that it's
defined and used purely within avic.c.
No functional change intended.
Fixes: 4bdec12aa8d6 ("KVM: SVM: Detect X2APIC virtualization (x2AVIC) support")
Link: https://patch.msgid.link/20251016190643.80529-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Unregister the GALog notifier (used to get notified of wake events for
blocking vCPUs) on kvm-amd.ko exit so that a KVM or IOMMU driver bug that
results in a spurious GALog event "only" results in a spurious IRQ, and
doesn't trigger a use-after-free due to executing unloaded module code.
Fixes: 5881f73757cc ("svm: Introduce AMD IOMMU avic_ga_log_notifier")
Reported-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Closes: https://lore.kernel.org/all/20250918130320.GA119526@k08j02272.eu95sqa
Link: https://patch.msgid.link/20251016190643.80529-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
AVIC and x2AVIC are fully functional since Zen 4, with no known hardware
errata. Enable AVIC and x2AVIC by default on Zen4+ so long as x2AVIC is
supported (to avoid enabling partial support for APIC virtualization by
default).
Internally, convert "avic" to an integer so that KVM can identify if the
user has asked to explicitly enable or disable AVIC, i.e. so that KVM
doesn't override an explicit 'y' from the user. Arbitrarily use -1 to
denote auto-mode, and accept the string "auto" for the module param in
addition to standard boolean values, i.e. continue to allow the user to
configure the "avic" module parameter to explicitly enable/disable AVIC.
To again maintain backward compatibility with a standard boolean param,
set KERNEL_PARAM_OPS_FL_NOARG, which tells the params infrastructure to
allow empty values for %true, i.e. to interpret a bare "avic" as "avic=y".
Take care to check for a NULL @val when looking for "auto"!
Lastly, always print "avic" as a boolean, since auto-mode is resolved
during module initialization, i.e. the user should never see "auto" in
sysfs.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250919215934.1590410-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move "avic" to avic.c so that it's colocated with the other AVIC specific
globals and module params, and so that avic_hardware_setup() is a bit more
self-contained, e.g. similar to sev_hardware_setup().
Deliberately set enable_apicv in svm.c as it's already globally visible
(defined by kvm.ko, not by kvm-amd.ko), and to clearly capture the
dependency on enable_apicv being initialized (svm_hardware_setup() clears
several AVIC-specific hooks when enable_apicv is disabled).
Alternatively, clearing of the hooks (and enable_ipiv) could be moved to
avic_hardware_setup(), but that's not obviously better, e.g. it's helpful
to isolate the setting of enable_apicv when reading code from the generic
x86 side of the world.
No functional change intended.
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't advise the end user to try to force enable AVIC when x2AVIC is
reported as supported in CPUID, as forcefully enabling AVIC isn't something
that should be done lightly. E.g. some Zen4 client systems hide AVIC but
leave x2AVIC behind, and while such a configuration is indeed due to buggy
firmware in the sense the reporting x2AVIC without AVIC is nonsensical,
KVM has no idea _why_ firmware disabled AVIC in the first place.
Suggesting that the user try to run with force_avic=y is sketchy even if
the user explicitly tries to enable AVIC, and will be downright
irresponsible once KVM starts enabling AVIC by default. Alternatively,
KVM could print the message only when the user explicitly asks for AVIC,
but running with force_avic=y isn't something that should be encouraged
for random users. force_avic is a useful knob for developers and perhaps
even advanced users, but isn't something that KVM should advertise broadly.
Opportunistically append a newline to the pr_warn() so that it prints out
immediately, and tweak the message to say that AVIC is unsupported instead
of disabled (disabled suggests that the kernel/KVM is somehow responsible).
Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Print the customary "AVIC enabled" informational message even when AVIC is
force enabled on a system that doesn't advertise supported for AVIC in
CPUID, as not printing the standard message can confuse users and tools.
Opportunistically clean up the scary message when AVIC is force enabled,
but keep it as separate message so that it is printed at level "warn",
versus the standard message only being printed for level "info".
Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set the "allow_apicv_in_x2apic_without_x2apic_virtualization" flag as part
of avic_hardware_setup() instead of handling in svm_hardware_setup(), and
make x2avic_enabled local to avic.c (setting the flag was the only use in
svm.c).
Tag avic_hardware_setup() with __init as necessary (it should have been
tagged __init long ago).
No functional change intended (aside from the side effects of tagging
avic_hardware_setup() with __init).
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move svm_set_x2apic_msr_interception() to avic.c as it's only relevant
when x2AVIC is enabled/supported and only called by AVIC code. In
addition to scoping AVIC code to avic.c, this will allow burying the
global x2avic_enabled variable in avic.
Opportunistically rename the helper to explicitly scope it to "avic".
No functional change intended.
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250919215934.1590410-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Configure IRTEs to GA log interrupts for device posted IRQs that hit
non-running vCPUs if and only if the target vCPU is blocking, i.e.
actually needs a wake event. If the vCPU has exited to userspace or was
preempted, generating GA log entries and interrupts is wasteful and
unnecessary, as the vCPU will be re-loaded and/or scheduled back in
irrespective of the GA log notification (avic_ga_log_notifier() is just a
fancy wrapper for kvm_vcpu_wake_up()).
Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to
track whether or not the vCPU's associated IRTEs are configured to
generate GA logs, but only set the synthetic bit in KVM's "cache", i.e.
never set the should-be-zero bit in tables that are used by hardware.
Use a synthetic bit instead of a dedicated boolean to minimize the odds
of messing up the locking, i.e. so that all the existing rules that apply
to avic_physical_id_entry for IS_RUNNING are reused verbatim for
GA_LOG_INTR.
Note, because KVM (by design) "puts" AVIC state in a "pre-blocking"
phase, using kvm_vcpu_is_blocking() to track the need for notifications
isn't a viable option.
Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts. KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.
Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.
Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().
Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.
Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into
__avic_vcpu_{load,put}(), and add a param to the helpers to communicate
whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update,
or just a quick update to set the CPU and IsRun.
Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't query a vCPU's blocking status when toggling AVIC on/off; barring
KVM bugs, the vCPU can't be blocking when refreshing AVIC controls. And
if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in
the correct state is desirable, i.e. well worth any overhead in a buggy
scenario.
Isolating the "real" load/put flows will allow moving the IOMMU IRTE
(de)activation logic from avic_refresh_apicv_exec_ctrl() to
avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's
physical ID entry and its IRTEs in a common path, under a single critical
section of ir_list_lock.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in
anticipation of moving the __avic_vcpu_{load,put}() calls into the
critical section, and because having a one-off helper with a name that's
easily confused with avic_pi_update_irte() is unnecessary.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to
find and kick vCPUs when a device posted interrupt serves as a wake event.
Lookups on a vCPU index are O(fast) (not sure what xa_load() actually
provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match
its index.
Unlike the Physical APIC Table, which is accessed by hardware when
virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_
use APIC IDs to fill the Physical APIC Table, but KVM has free rein over
the format/meaning of the GA tag.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is
freed with ir_list entries, i.e. if KVM leaves a dangling IRTE.
Initialize the per-vCPU interrupt remapping list and its lock even if AVIC
is disabled so that the WARN doesn't hit false positives (and so that KVM
doesn't need to call into AVIC code for a simple sanity check).
Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN if updating GA information for an IRTE entry fails as modifying an
IRTE should only fail if KVM is buggy, e.g. has stale metadata, and
because returning an error that is always ignored is pointless.
Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating IRTE GA fields, keep processing all other IRTEs if an update
fails, as not updating later entries risks making a bad situation worse.
Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN if (de)activating "guest mode" for an IRTE entry fails as modifying
an IRTE should only fail if KVM is buggy, e.g. has stale metadata.
Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't short-circuit IRTE updating when (de)activating AVIC based on the
VM having assigned devices, as nothing prevents AVIC (de)activation from
racing with device (de)assignment. And from a performance perspective,
bailing early when there is no assigned device doesn't add much, as
ir_list_lock will never be contended if there's no assigned device.
Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't bother checking if a VM has an assigned device when updating AVIC
vCPU affinity, querying ir_list is just as cheap and nothing prevents
racing with changes in device assignment.
Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
inhibited
If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode. KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.
Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity. Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.
This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.
Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running. amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that svm_ir_list_add() isn't overloaded with all manner of weird
things, fold it into avic_pi_update_irte(), and more importantly take
ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
that's shoved into the IRTE is fresh. While preemption (and IRQs) is
disabled on the task performing the IRTE update, thanks to irqfds.lock,
that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
is irrelevant.
Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up
and doesn't provide the necessary metadata. Returning an error up the
stack without actually handling the error is useless and confusing.
Link: https://lore.kernel.org/r/20250611224604.313496-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted
IRQ, in common x86 code. This will allow for additional consolidation of
the SVM and VMX code.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures. Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Clean up the return paths for avic_pi_update_irte() now that the
refactoring dust has settled.
Opportunistically drop the pr_err() on IRTE update failures. Logging that
a failure occurred without _any_ context is quite useless.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the pi_irte_update tracepoint to common x86, and call it whenever the
IRTE is modified. Tracing only the modifications that result in an IRQ
being posted to a vCPU makes the tracepoint useless for debugging.
Drop the vendor specific address; plumbing that into common code isn't
worth the trouble, as the address is meaningless without a whole pile of
other information that isn't provided in any tracepoint.
Link: https://lore.kernel.org/r/20250611224604.313496-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Hoist the logic for identifying the target vCPU for a posted interrupt
into common x86. The code is functionally identical between Intel and
AMD.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Genericize SVM's get_pi_vcpu_info() so that it can be shared with VMX.
The only SVM specific information it provides is the AVIC back page, and
that can be trivially retrieved by its sole caller.
No functional change intended.
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM explicitly passes the new/current GSI routing to
pi_update_irte(), simply use the provided routing entry and stop walking
the routing table to find that entry. KVM, via setup_routing_entry() and
sanity checked by kvm_get_msi_route(), disallows having a GSI configured
to trigger multiple MSIs.
I.e. this is subtly a glorified nop, as KVM allows at most one MSI per
GSI, the for-loop can only ever process one entry, and that entry is the
new/current entry (see the WARN_ON_ONCE() added by "KVM: x86: Pass new
routing entries and irqfd when updating IRTEs" to ensure @new matches the
entry found in the routing table).
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a comment to explain why KVM clears IsRunning when putting a vCPU,
even though leaving IsRunning=1 would be ok from a functional perspective.
Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
fast as to induce a 50%+ loss in performance.
Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as
hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR
write emulation, and might fail to VM-Exit on the sending vCPU, if
IsRunning was recently cleared.
The absence of the VM-Exit leads to KVM not waking (or triggering nested
VM-Exit of) the target vCPU(s) of the IPI, which can lead to hung vCPUs,
unbounded delays in L2 execution, etc.
To workaround the erratum, simply disable IPI virtualization, which
prevents KVM from setting IsRunning and thus eliminates the race where
hardware sees a stale IsRunning=1. As a result, all ICR writes (except
when "Self" shorthand is used) will VM-Exit and therefore be correctly
emulated by KVM.
Disabling IPI virtualization does carry a performance penalty, but
benchmarkng shows that enabling AVIC without IPI virtualization is still
much better than not using AVIC at all, because AVIC still accelerates
posted interrupts and the receiving end of the IPIs.
Note, when virtualizing Self-IPIs, the CPU skips reading the physical ID
table and updates the vIRR directly (because the vCPU is by definition
actively running), i.e. Self-IPI isn't susceptible to the erratum *and*
is still accelerated by hardware.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: rebase, massage changelog, disallow user override]
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
module param, by never setting IsRunning. SVM doesn't provide a way to
disable IPI virtualization in hardware, but by ensuring CPUs never see
IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
VM-Exit.
To avoid setting the real IsRunning bit, while still allowing KVM to use
each vCPU's entry to update GA log entries, simply maintain a shadow of
the entry, without propagating IsRunning updates to the real table when
IPI virtualization is disabled.
Providing a way to effectively disable IPI virtualization will allow KVM
to safely enable AVIC on hardware that is susceptible to erratum #1235,
which causes hardware to sometimes fail to detect that the IsRunning bit
has been cleared by software.
Note, the table _must_ be fully populated, as broadcast IPIs skip invalid
entries, i.e. won't generate VM-Exit if every entry is invalid, and so
simply pointing the VMCB at a common dummy table won't work.
Alternatively, KVM could allocate a shadow of the entire table, but that'd
be a waste of 4KiB since the per-vCPU entry doesn't actually consume an
additional 8 bytes of memory (vCPU structures are large enough that they
are backed by order-N pages).
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: keep "entry" variables, reuse enable_ipiv, split from erratum]
Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index
the table directly. Caching a pointer address is completely unnecessary
for performance, and while the field technically caches the result of the
pointer calculation, it's all too easy to misinterpret the name and think
that the field somehow caches the _data_ in the table.
No functional change intended.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allocate and track AVIC's logical and physical tables as u32 and u64
pointers respectively, as managing the pages as "struct page" pointers
adds an almost absurd amount of boilerplate and complexity. E.g. with
page_address() out of the way, svm->avic_physical_id_cache becomes
completely superfluous, and will be removed in a future cleanup.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop avic_get_physical_id_entry()'s compatibility check on the incoming
ID, as its sole caller, avic_init_backing_page(), performs the exact same
check. Drop avic_get_physical_id_entry() entirely as the only remaining
functionality is getting the address of the Physical ID table, and
accessing the array without an immediate bounds check is kludgy.
Opportunistically add a compile-time assertion to ensure the vcpu_id can't
result in a bounds overflow, e.g. if KVM (really) messed up a maximum
physical ID #define, as well as run-time assertions so that a NULL pointer
dereference is morphed into a safer WARN().
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
an ID that is too big, but otherwise allow vCPU creation to succeed.
Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
than 254 (AVIC) or 511 (x2AVIC).
Alternatively, KVM could advertise an accurate value depending on which
AVIC mode is in use, but that wouldn't really solve the underlying problem,
e.g. would be a breaking change if KVM were to ever try and enable AVIC or
x2AVIC by default.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop vcpu_svm's avic_backing_page pointer and instead grab the physical
address of KVM's vAPIC page directly from the source. Getting a physical
address from a kernel virtual address is not an expensive operation, and
getting the physical address from a struct page is *more* expensive for
CONFIG_SPARSEMEM=y kernels. Regardless, none of the paths that consume
the address are hot paths, i.e. shaving cycles is not a priority.
Eliminating the "cache" means KVM doesn't have to worry about the cache
being invalid, which will simplify a future fix when dealing with vCPU IDs
that are too big.
WARN if KVM attempts to allocate a vCPU's AVIC backing page without an
in-kernel local APIC. avic_init_vcpu() bails early if the APIC is not
in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have
been created, i.e. it should be impossible to reach
avic_init_backing_page() without the vAPIC being allocated.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to get the physical address of the AVIC backing page, both
to deduplicate code and to prepare for getting the address directly from
apic->regs, at which point it won't be all that obvious that the address
in question is what SVM calls the AVIC backing page.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
maximum theoretical physical address for x86-64 CPUs, as x86-64 is
currently defined (going beyond PA52 would require an entirely new paging
mode, which would arguably create a new, different architecture).
All usage in KVM masks the result of page_to_phys(), which on x86-64 is
guaranteed to be 4KiB aligned and a legal physical address; if either of
those requirements doesn't hold true, KVM has far bigger problems.
Drop masking the avic_backing_page with
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
keep the macro even though it's unused in functional code. It's a
distinct architectural define, and having the definition in software
helps visualize the layout of an entry. And to be hyper-paranoid about
MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
maximum supported physical address stays in bounds.
The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
SME's C-bit resides between bits 51:11; that holds true for current CPUs,
but isn't required by AMD's architecture:
In some implementations, the bit used may be a physical address bit
Key word being "may".
Opportunistically use the GENMASK_ULL() version for
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
than a set of repeating Fs.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
and its only usage is to mask the default APIC base, which is 4KiB aligned
and (obviously) a legal physical address.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|