KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

KVM currently forces non-cacheable memory attributes (either Normal-NC or Device-nGnRE) for a region based on pfn_is_map_memory(), i.e. whether or not the kernel has a cacheable alias for it. This is necessary in situations where KVM needs to perform CMOs on the region but is unnecessarily restrictive when hardware obviates the need for CMOs. KVM doesn't need to perform any CMOs on hardware with FEAT_S2FWB and CTR_EL0.DIC. As luck would have it, there are implementations in the wild that need to map regions of a device with cacheable attributes to function properly. An example of this is Nvidia's Grace Hopper/Blackwell systems where GPU memory is interchangeable with DDR and retains properties such as cacheability, unaligned accesses, atomics and handling of executable faults. Of course, for this to work in a VM the GPU memory needs to have a cacheable mapping at stage-2. Allow cacheable stage-2 mappings to be created on supporting hardware when the VMA has cacheable memory attributes. Check these preconditions during memslot creation (in addition to fault handling) to potentially 'fail-fast' as a courtesy to userspace. CC: Oliver Upton <oliver.upton@linux.dev> CC: Sean Christopherson <seanjc@google.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Suggested-by: David Hildenbrand <david@redhat.com> Tested-by: Donald Dutile <ddutile@redhat.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20250705071717.5062-6-ankita@nvidia.com [ Oliver: refine changelog, squash kvm_supports_cacheable_pfnmap() patch ] Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
author: Ankit Agrawal <ankita@nvidia.com> 2025-07-05 07:17:15 +0000
committer: Oliver Upton <oliver.upton@linux.dev> 2025-07-07 16:54:19 -0700
commit: 0c67288e0c8bc1e39d7057fbae65bf57ca943676 (patch)
tree: d841df94a9b0b3172398232cd1d81f0ecf312993 /arch/arm64/kvm/mmu.c
parent: 2a8dfab26677ae37243a40d170704588f791cfd3 (diff)
1 files changed, 37 insertions, 22 deletions
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 708a635e38bc..3a9e2248f82d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1654,18 +1654,39 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
+	/*
+	 * Check if this is non-struct page memory PFN, and cannot support
+	 * CMOs. It could potentially be unsafe to access as cachable.
+	 */
 	if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) {
-		/*
-		 * If the page was identified as device early by looking at
-		 * the VMA flags, vma_pagesize is already representing the
-		 * largest quantity we can map.  If instead it was mapped
-		 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
-		 * and must not be upgraded.
-		 *
-		 * In both cases, we don't let transparent_hugepage_adjust()
-		 * change things at the last minute.
-		 */
-		s2_force_noncacheable = true;
+		if (is_vma_cacheable) {
+			/*
+			 * Whilst the VMA owner expects cacheable mapping to this
+			 * PFN, hardware also has to support the FWB and CACHE DIC
+			 * features.
+			 *
+			 * ARM64 KVM relies on kernel VA mapping to the PFN to
+			 * perform cache maintenance as the CMO instructions work on
+			 * virtual addresses. VM_PFNMAP region are not necessarily
+			 * mapped to a KVA and hence the presence of hardware features
+			 * S2FWB and CACHE DIC are mandatory to avoid the need for
+			 * cache maintenance.
+			 */
+			if (!kvm_supports_cacheable_pfnmap())
+				return -EFAULT;
+		} else {
+			/*
+			 * If the page was identified as device early by looking at
+			 * the VMA flags, vma_pagesize is already representing the
+			 * largest quantity we can map.  If instead it was mapped
+			 * via __kvm_faultin_pfn(), vma_pagesize is set to PAGE_SIZE
+			 * and must not be upgraded.
+			 *
+			 * In both cases, we don't let transparent_hugepage_adjust()
+			 * change things at the last minute.
+			 */
+			s2_force_noncacheable = true;
+		}
 	} else if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
@@ -1674,15 +1695,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		writable = false;
 	}
 
-	/*
-	 * Prevent non-cacheable mappings in the stage-2 if a region of memory
-	 * is cacheable in the primary MMU and the kernel lacks a cacheable
-	 * alias. KVM cannot guarantee coherency between the guest/host aliases
-	 * without the ability to perform CMOs.
-	 */
-	if (is_vma_cacheable && s2_force_noncacheable)
-		return -EINVAL;
-
 	if (exec_fault && s2_force_noncacheable)
 		return -ENOEXEC;
 
@@ -2243,8 +2255,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				break;
 			}
 
-			/* Cacheable PFNMAP is not allowed */
-			if (kvm_vma_is_cacheable(vma)) {
+			/*
+			 * Cacheable PFNMAP is allowed only if the hardware
+			 * supports it.
+			 */
+			if (kvm_vma_is_cacheable(vma) && !kvm_supports_cacheable_pfnmap()) {
 				ret = -EINVAL;
 				break;
 			}
author	Ankit Agrawal <ankita@nvidia.com>	2025-07-05 07:17:15 +0000
committer	Oliver Upton <oliver.upton@linux.dev>	2025-07-07 16:54:19 -0700
commit	0c67288e0c8bc1e39d7057fbae65bf57ca943676 (patch)
tree	d841df94a9b0b3172398232cd1d81f0ecf312993 /arch/arm64/kvm/mmu.c
parent	2a8dfab26677ae37243a40d170704588f791cfd3 (diff)