summaryrefslogtreecommitdiff
path: root/drivers/gpu
AgeCommit message (Collapse)Author
2026-02-05drm/amdgpu/sdma5.2: enable queue resets unconditionallyAlex Deucher
There is no firmware version dependency. This also enables sdma queue resets on all SDMA 5.2.x based chips. Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-05drm/amdgpu/sdma5: enable queue resets unconditionallyAlex Deucher
There is no firmware version dependency. Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask") Cc: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-05drm/amdgpu: Fix memory leak in amdgpu_ras_init()Zilin Guan
When amdgpu_nbio_ras_sw_init() fails in amdgpu_ras_init(), the function returns directly without freeing the allocated con structure, leading to a memory leak. Fix this by jumping to the release_con label to properly clean up the allocated memory before returning the error code. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: fdc94d3a8c88 ("drm/amdgpu: Rework pcie_bif ras sw_init") Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-05drm/amdgpu: Use kvfree instead of kfree in amdgpu_gmc_get_nps_memranges()Zilin Guan
amdgpu_discovery_get_nps_info() internally allocates memory for ranges using kvcalloc(), which may use vmalloc() for large allocation. Using kfree() to release vmalloc memory will lead to a memory corruption. Use kvfree() to safely handle both kmalloc and vmalloc allocations. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: b194d21b9bcc ("drm/amdgpu: Use NPS ranges from discovery table") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-05drm/amdgpu: Fix memory leak in amdgpu_acpi_enumerate_xcc()Zilin Guan
In amdgpu_acpi_enumerate_xcc(), if amdgpu_acpi_dev_init() returns -ENOMEM, the function returns directly without releasing the allocated xcc_info, resulting in a memory leak. Fix this by ensuring that xcc_info is properly freed in the error paths. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: 4d5275ab0b18 ("drm/amdgpu: Add parsing of acpi xcc objects") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-05drm/amd/ras: statistic xgmi training error countStanley.Yang
Report xgmi training error uncorrectable error count. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-05drm/vmwgfx: Return the correct value in vmw_translate_ptr functionsIan Forbes
Before the referenced fixes these functions used a lookup function that returned a pointer. This was changed to another lookup function that returned an error code with the pointer becoming an out parameter. The error path when the lookup failed was not changed to reflect this change and the code continued to return the PTR_ERR of the now uninitialized pointer. This could cause the vmw_translate_ptr functions to return success when they actually failed causing further uninitialized and OOB accesses. Reported-by: Kuzey Arda Bulut <kuzeyardabulut@gmail.com> Fixes: a309c7194e8a ("drm/vmwgfx: Remove rcu locks from user resources") Signed-off-by: Ian Forbes <ian.forbes@broadcom.com> Reviewed-by: Zack Rusin <zack.rusin@broadcom.com> Signed-off-by: Zack Rusin <zack.rusin@broadcom.com> Link: https://patch.msgid.link/20260113175357.129285-1-ian.forbes@broadcom.com
2026-02-05drm/vmwgfx: Set a unique ID for each submitted command bufferIan Forbes
These IDs are logged by the Hypervisor when debug logging is enabled. Having the IDs in the log makes it much easier to see when command buffers start and finish. They can also be used by logging/tracing in the Guest to help correlate between Guest and Hypervisor logs. Signed-off-by: Ian Forbes <ian.forbes@broadcom.com> Signed-off-by: Zack Rusin <zack.rusin@broadcom.com> Link: https://patch.msgid.link/20260109155139.3259493-1-ian.forbes@broadcom.com
2026-02-05drm/vmwgfx: Fix invalid kref_put callback in vmw_bo_dirty_releaseBrad Spengler
The kref_put() call uses (void *)kvfree as the release callback, which is incorrect. kref_put() expects a function with signature void (*release)(struct kref *), but kvfree has signature void (*)(const void *). Calling through an incompatible function pointer is undefined behavior. The code only worked by accident because ref_count is the first member of vmw_bo_dirty, making the kref pointer equal to the struct pointer. Fix this by adding a proper release callback that uses container_of() to retrieve the containing structure before freeing. Fixes: c1962742ffff ("drm/vmwgfx: Use kref in vmw_bo_dirty") Signed-off-by: Brad Spengler <brad.spengler@opensrcsec.com> Signed-off-by: Zack Rusin <zack.rusin@broadcom.com> Cc: Ian Forbes <ian.forbes@broadcom.com> Link: https://patch.msgid.link/20260107171236.3573118-1-zack.rusin@broadcom.com
2026-02-05drm/xe/pm: Disable D3Cold for BMG only on specific platformsKarthik Poosa
Restrict D3Cold disablement for BMG to unsupported NUC platforms, instead of disabling it on all platforms. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: 3e331a6715ee ("drm/xe/pm: Temporarily disable D3Cold on BMG") Link: https://patch.msgid.link/20260123173238.1642383-1-karthik.poosa@intel.com Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 39125eaf8863ab09d70c4b493f58639b08d5a897) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-02-05drm/xe: Fix kerneldoc for xe_tlb_inval_job_alloc_depShuicheng Lin
Correct the function name in the kerneldoc. It is for below warning: "Warning: drivers/gpu/drm/xe/xe_tlb_inval_job.c:210 expecting prototype for xe_tlb_inval_alloc_dep(). Prototype was for xe_tlb_inval_job_alloc_dep() instead" Fixes: 15366239e2130 ("drm/xe: Decouple TLB invalidations from GT") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-8-shuicheng.lin@intel.com (cherry picked from commit 9f9c117ac566cb567dd56cc5b7564c45653f7a2a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-02-05drm/xe: Fix kerneldoc for xe_gt_tlb_inval_init_earlyShuicheng Lin
Correct the function name in the kerneldoc. It is for below warning: "Warning: drivers/gpu/drm/xe/xe_tlb_inval.c:136 expecting prototype for xe_gt_tlb_inval_init(). Prototype was for xe_gt_tlb_inval_init_early() instead" v2: add () for the function. (Michal) Fixes: db16f9d90c1d9 ("drm/xe: Split TLB invalidation code in frontend and backend") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-7-shuicheng.lin@intel.com (cherry picked from commit 0651dbb9d6a72e99569576fbec4681fd8160d161) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-02-05drm/xe: Fix kerneldoc for xe_migrate_exec_queueShuicheng Lin
Correct the function name in the kerneldoc. It is for below warning: "Warning: drivers/gpu/drm/xe/xe_migrate.c:1262 expecting prototype for xe_get_migrate_exec_queue(). Prototype was for xe_migrate_exec_queue() instead" Fixes: 916ee4704a865 ("drm/xe/vf: Register CCS read/write contexts with Guc") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-6-shuicheng.lin@intel.com (cherry picked from commit 9fd8da717934f05125b9ba6782622c459a368dc0) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-02-05drm/xe/query: Fix topology query pointer advanceShuicheng Lin
The topology query helper advanced the user pointer by the size of the pointer, not the size of the structure. This can misalign the output blob and corrupt the following mask. Fix the increment to use sizeof(*topo). There is no issue currently, as sizeof(*topo) happens to be equal to sizeof(topo) on 64-bit systems (both evaluate to 8 bytes). Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260130043907.465128-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit c2a6859138e7f73ad904be17dd7d1da6cc7f06b3) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-02-05drm/xe/guc: Fix kernel-doc warning in GuC scheduler ABI headerChaitanya Kumar Borah
The GuC scheduler ABI header contains a file-level comment that is not intended to document a kernel-doc symbol. Using kernel-doc comment syntax (/** */) triggers kernel-doc warnings. With "-Werror", this causes the build to fail. Convert the comment to a regular block comment. HDRTEST drivers/gpu/drm/xe/abi/guc_scheduler_abi.h Warning: drivers/gpu/drm/xe/abi/guc_scheduler_abi.h:11 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst * Generic defines required for registration with and submissions to the GuC 1 warnings as errors make[6]: *** [drivers/gpu/drm/xe/Makefile:377: drivers/gpu/drm/xe/abi/guc_scheduler_abi.hdrtest] Error 3 make[5]: *** [scripts/Makefile.build:544: drivers/gpu/drm/xe] Error 2 make[4]: *** [scripts/Makefile.build:544: drivers/gpu/drm] Error 2 make[3]: *** [scripts/Makefile.build:544: drivers/gpu] Error 2 make[2]: *** [scripts/Makefile.build:544: drivers] Error 2 make[1]: *** [/home/kbuild2/kernel/Makefile:2088: .] Error 2 make: *** [Makefile:248: __sub-make] Error 2 v2: - Add Fixes tag (Daniele) Fixes: b0c5cf4f5917 ("drm/gt/guc: extract scheduler-related defines from guc_fwif.h") Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20260130135210.2659200-1-chaitanya.kumar.borah@intel.com (cherry picked from commit f89dbe14a0c8854b7aaf960dd842c10698b3ff19) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-02-05drm/xe/guc: Fix CFI violation in debugfs access.Daniele Ceraolo Spurio
xe_guc_print_info is void-returning, but the function pointer it is assigned to expects an int-returning function, leading to the following CFI error: [ 206.873690] CFI failure at guc_debugfs_show+0xa1/0xf0 [xe] (target: xe_guc_print_info+0x0/0x370 [xe]; expected type: 0xbe3bc66a) Fix this by updating xe_guc_print_info to return an integer. Fixes: e15826bb3c2c ("drm/xe/guc: Refactor GuC debugfs initialization") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: George D Sworo <george.d.sworo@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129182547.32899-2-daniele.ceraolospurio@intel.com (cherry picked from commit dd8ea2f2ab71b98887fdc426b0651dbb1d1ea760) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-02-05drm/xe/guc: Fix CFI violation in debugfs access.Daniele Ceraolo Spurio
xe_guc_print_info is void-returning, but the function pointer it is assigned to expects an int-returning function, leading to the following CFI error: [ 206.873690] CFI failure at guc_debugfs_show+0xa1/0xf0 [xe] (target: xe_guc_print_info+0x0/0x370 [xe]; expected type: 0xbe3bc66a) Fix this by updating xe_guc_print_info to return an integer. Fixes: e15826bb3c2c ("drm/xe/guc: Refactor GuC debugfs initialization") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: George D Sworo <george.d.sworo@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129182547.32899-2-daniele.ceraolospurio@intel.com (cherry picked from commit dd8ea2f2ab71b98887fdc426b0651dbb1d1ea760) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-02-05drm/bridge: imx8mp-hdmi-pai: enable PM runtimeShengjiu Wang
There is an audio channel shift issue with multi channel case - the channel order is correct for the first run, but the channel order is shifted for the second run. The fix method is to reset the PAI interface at the end of playback. The reset can be handled by PM runtime, so enable PM runtime. Fixes: 0205fae6327a ("drm/bridge: imx: add driver for HDMI TX Parallel Audio Interface") Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Reviewed-by: Liu Ying <victor.liu@nxp.com> Signed-off-by: Liu Ying <victor.liu@nxp.com> Link: https://lore.kernel.org/r/20260130080910.3532724-1-shengjiu.wang@nxp.com
2026-02-05nouveau/vmm: start tracking if the LPT PTE is valid. (v6)Dave Airlie
When NVK enabled large pages userspace tests were seeing fault reports at a valid address. There was a case where an address moving from 64k page to 4k pages could expose a race between unmapping the 4k page, mapping the 64k page and unref the 4k pages. Unref 4k pages would cause the dual-page table handling to always set the LPTE entry to SPARSE or INVALID, but if we'd mapped a valid LPTE in the meantime, it would get trashed. Keep track of when a valid LPTE has been referenced, and don't reset in that case. This adds an lpte valid tracker and lpte reference count. Whenever an lpte is referenced, it gets made valid and the ref count increases, whenever it gets unreference the refcount is tracked. Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/14610 Reviewed-by: Mary Guillemard <mary@mary.zone> Tested-by: Mary Guillemard <mary@mary.zone> Tested-by: Mel Henning <mhenning@darkrefraction.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260204030208.2313241-4-airlied@gmail.com
2026-02-05nouveau/vmm: increase size of vmm pte tracker struct to u32 (v2)Dave Airlie
We need to tracker large counts of spte than previously due to unref getting delayed sometimes. This doesn't fix LPT tracking yet, it just creates space for it. Reviewed-by: Mary Guillemard <mary@mary.zone> Tested-by: Mary Guillemard <mary@mary.zone> Tested-by: Mel Henning <mhenning@darkrefraction.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260204030208.2313241-3-airlied@gmail.com
2026-02-05nouveau/vmm: rewrite pte tracker using a struct and bitfields.Dave Airlie
I want to increase the counters here and start tracking LPTs as well as there are certain situations where userspace with mixed page sizes can cause ref/unrefs to live longer so need better reference counting. This should be entirely non-functional. Reviewed-by: Mary Guillemard <mary@mary.zone> Tested-by: Mary Guillemard <mary@mary.zone> Tested-by: Mel Henning <mhenning@darkrefraction.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260204030208.2313241-2-airlied@gmail.com
2026-02-04Merge branch 'pm-runtime'Rafael J. Wysocki
Merge updates related to runtime PM for 6.20-rc1/7.0-rc1: - Make several drivers discard pm_runtime_put() return value in preparation for converting that function to a void one (Rafael Wysocki) * pm-runtime: drm: Discard pm_runtime_put() return value genirq/chip: Change irq_chip_pm_put() return type to void scsi: ufs: core: Discard pm_runtime_put() return values platform/chrome: cros_hps_i2c: Discard pm_runtime_put() return value coresight: Discard pm_runtime_put() return values hwspinlock: omap: Discard pm_runtime_put() return value watchdog: rzv2h_wdt: Discard pm_runtime_put() return value watchdog: rz: Discard pm_runtime_put() return values media: ccs: Discard pm_runtime_put() return value drm/imagination: Discard pm_runtime_put() return value USB: core: Discard pm_runtime_put() return value
2026-02-04drm/xe/pm: Disable D3Cold for BMG only on specific platformsKarthik Poosa
Restrict D3Cold disablement for BMG to unsupported NUC platforms, instead of disabling it on all platforms. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: 3e331a6715ee ("drm/xe/pm: Temporarily disable D3Cold on BMG") Link: https://patch.msgid.link/20260123173238.1642383-1-karthik.poosa@intel.com Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 39125eaf8863ab09d70c4b493f58639b08d5a897) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-02-04drm/xe: Fix kerneldoc for xe_tlb_inval_job_alloc_depShuicheng Lin
Correct the function name in the kerneldoc. It is for below warning: "Warning: drivers/gpu/drm/xe/xe_tlb_inval_job.c:210 expecting prototype for xe_tlb_inval_alloc_dep(). Prototype was for xe_tlb_inval_job_alloc_dep() instead" Fixes: 15366239e2130 ("drm/xe: Decouple TLB invalidations from GT") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-8-shuicheng.lin@intel.com (cherry picked from commit 9f9c117ac566cb567dd56cc5b7564c45653f7a2a) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-02-04drm/xe: Fix kerneldoc for xe_gt_tlb_inval_init_earlyShuicheng Lin
Correct the function name in the kerneldoc. It is for below warning: "Warning: drivers/gpu/drm/xe/xe_tlb_inval.c:136 expecting prototype for xe_gt_tlb_inval_init(). Prototype was for xe_gt_tlb_inval_init_early() instead" v2: add () for the function. (Michal) Fixes: db16f9d90c1d9 ("drm/xe: Split TLB invalidation code in frontend and backend") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-7-shuicheng.lin@intel.com (cherry picked from commit 0651dbb9d6a72e99569576fbec4681fd8160d161) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-02-04drm/xe: Fix kerneldoc for xe_migrate_exec_queueShuicheng Lin
Correct the function name in the kerneldoc. It is for below warning: "Warning: drivers/gpu/drm/xe/xe_migrate.c:1262 expecting prototype for xe_get_migrate_exec_queue(). Prototype was for xe_migrate_exec_queue() instead" Fixes: 916ee4704a865 ("drm/xe/vf: Register CCS read/write contexts with Guc") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-6-shuicheng.lin@intel.com (cherry picked from commit 9fd8da717934f05125b9ba6782622c459a368dc0) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-02-04drm/xe/query: Fix topology query pointer advanceShuicheng Lin
The topology query helper advanced the user pointer by the size of the pointer, not the size of the structure. This can misalign the output blob and corrupt the following mask. Fix the increment to use sizeof(*topo). There is no issue currently, as sizeof(*topo) happens to be equal to sizeof(topo) on 64-bit systems (both evaluate to 8 bytes). Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260130043907.465128-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit c2a6859138e7f73ad904be17dd7d1da6cc7f06b3) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2026-02-04drm/mgag200: fix mgag200_bmc_stop_scanout()Jacob Keller
The mgag200_bmc_stop_scanout() function is called by the .atomic_disable() handler for the MGA G200 VGA BMC encoder. This function performs a few register writes to inform the BMC of an upcoming mode change, and then polls to wait until the BMC actually stops. The polling is implemented using a busy loop with udelay() and an iteration timeout of 300, resulting in the function blocking for 300 milliseconds. The function gets called ultimately by the output_poll_execute work thread for the DRM output change polling thread of the mgag200 driver: kworker/0:0-mm_ 3528 [000] 4555.315364: ffffffffaa0e25b3 delay_halt.part.0+0x33 ffffffffc03f6188 mgag200_bmc_stop_scanout+0x178 ffffffffc087ae7a disable_outputs+0x12a ffffffffc087c12a drm_atomic_helper_commit_tail+0x1a ffffffffc03fa7b6 mgag200_mode_config_helper_atomic_commit_tail+0x26 ffffffffc087c9c1 commit_tail+0x91 ffffffffc087d51b drm_atomic_helper_commit+0x11b ffffffffc0509694 drm_atomic_commit+0xa4 ffffffffc05105e8 drm_client_modeset_commit_atomic+0x1e8 ffffffffc0510ce6 drm_client_modeset_commit_locked+0x56 ffffffffc0510e24 drm_client_modeset_commit+0x24 ffffffffc088a743 __drm_fb_helper_restore_fbdev_mode_unlocked+0x93 ffffffffc088a683 drm_fb_helper_hotplug_event+0xe3 ffffffffc050f8aa drm_client_dev_hotplug+0x9a ffffffffc088555a output_poll_execute+0x29a ffffffffa9b35924 process_one_work+0x194 ffffffffa9b364ee worker_thread+0x2fe ffffffffa9b3ecad kthread+0xdd ffffffffa9a08549 ret_from_fork+0x29 On a server running ptp4l with the mgag200 driver loaded, we found that ptp4l would sometimes get blocked from execution because of this busy waiting loop. Every so often, approximately once every 20 minutes -- though with large variance -- the output_poll_execute() thread would detect some sort of change that required performing a hotplug event which results in attempting to stop the BMC scanout, resulting in a 300msec delay on one CPU. On this system, ptp4l was pinned to a single CPU. When the output_poll_execute() thread ran on that CPU, it blocked ptp4l from executing for its 300 millisecond duration. This resulted in PTP service disruptions such as failure to send a SYNC message on time, failure to handle ANNOUNCE messages on time, and clock check warnings from the application. All of this despite the application being configured with FIFO_RT and a higher priority than the background workqueue tasks. (However, note that the kernel did not use CONFIG_PREEMPT...) It is unclear if the event is due to a faulty VGA connection, another bug, or actual events causing a change in the connection. At least on the system under test it is not a one-time event and consistently causes disruption to the time sensitive applications. The function has some helpful comments explaining what steps it is attempting to take. In particular, step 3a and 3b are explained as such: 3a - The third step is to verify if there is an active scan. We are waiting on a 0 on remhsyncsts (<XSPAREREG<0>. 3b - This step occurs only if the remove is actually scanning. We are waiting for the end of the frame which is a 1 on remvsyncsts (<XSPAREREG<1>). The actual steps 3a and 3b are implemented as while loops with a non-sleeping udelay(). The first step iterates while the tmp value at position 0 is *not* set. That is, it keeps iterating as long as the bit is zero. If the bit is already 0 (because there is no active scan), it will iterate the entire 300 attempts which wastes 300 milliseconds in total. This is opposite of what the description claims. The step 3b logic only executes if we do not iterate over the entire 300 attempts in the first loop. If it does trigger, it is trying to check and wait for a 1 on the remvsyncsts. However, again the condition is actually inverted and it will loop as long as the bit is 1, stopping once it hits zero (rather than the explained attempt to wait until we see a 1). Worse, both loops are implemented using non-sleeping waits which spin instead of allowing the scheduler to run other processes. If the kernel is not configured to allow arbitrary preemption, it will waste valuable CPU time doing nothing. There does not appear to be any documentation for the BMC register interface, beyond what is in the comments here. It seems more probable that the comment here is correct and the implementation accidentally got inverted from the intended logic. Reading through other DRM driver implementations, it does not appear that the .atomic_enable or .atomic_disable handlers need to delay instead of sleep. For example, the ast_astdp_encoder_helper_atomic_disable() function calls ast_dp_set_phy_sleep() which uses msleep(). The "atomic" in the name is referring to the atomic modesetting support, which is the support to enable atomic configuration from userspace, and not to the "atomic context" of the kernel. There is no reason to use udelay() here if a sleep would be sufficient. Replace the while loops with a read_poll_timeout() based implementation that will sleep between iterations, and which stops polling once the condition is met (instead of looping as long as the condition is met). This aligns with the commented behavior and avoids blocking on the CPU while doing nothing. Note the RREG_DAC is implemented using a statement expression to allow working properly with the read_poll_timeout family of functions. The other RREG_<TYPE> macros ought to be cleaned up to have better semantics, and several places in the mgag200 driver could make use of RREG_DAC or similar RREG_* macros should likely be cleaned up for better semantics as well, but that task has been left as a future cleanup for a non-bugfix. Fixes: 414c45310625 ("mgag200: initial g200se driver (v2)") Suggested-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patch.msgid.link/20260202-jk-mgag200-fix-bad-udelay-v2-1-ce1e9665987d@intel.com
2026-02-04nouveau/gsp: fix suspend/resume regression on r570 firmwareDave Airlie
The r570 firmware with certain GPUs (at least RTX6000) needs this flag to reflect the suspend vs runtime PM state of the driver. This uses that info to set the correct flags to the firmware. This fixes a regression on RTX6000 and other GPUs since r570 firmware was enabled. Fixes: 53dac0623853 ("drm/nouveau/gsp: add support for 570.144") Cc: <stable@vger.kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Tested-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260203052431.2219998-4-airlied@gmail.com
2026-02-04nouveau: add a third state to the fini handler.Dave Airlie
This is just refactoring to allow the lower layers to distinguish between suspend and runtime suspend. GSP 570 needs to set a flag with the GPU is going into GCOFF, this flag taken from the opengpu driver is set whenever runtime suspend is enterning GCOFF but not for normal suspend paths. This just refactors the code, a subsequent patch use the information. Fixes: 53dac0623853 ("drm/nouveau/gsp: add support for 570.144") Cc: <stable@vger.kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Tested-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260203052431.2219998-3-airlied@gmail.com
2026-02-04nouveau/gsp: use rpc sequence numbers properly.Dave Airlie
There are two layers of sequence numbers, one at the msg level and one at the rpc level. 570 firmware started asserting on the sequence numbers being in the right order, and we would see nocat records with asserts in them. Add the rpc level sequence number support. Fixes: 53dac0623853 ("drm/nouveau/gsp: add support for 570.144") Cc: <stable@vger.kernel.org> Signed-off-by: Dave Airlie <airlied@redhat.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Tested-by: Lyude Paul <lyude@redhat.com> Link: https://patch.msgid.link/20260203052431.2219998-2-airlied@gmail.com
2026-02-03drm/amdgpu: Fix double deletion of validate_listHarish Kasiviswanathan
If amdgpu_amdkfd_gpuvm_free_memory_of_gpu() fails after kgd_mem is removed from validate_list, the mem handle still lingers in the KFD idr. This means when process is terminated, kfd_process_free_outstanding_kfd_bos() will call amdgpu_amdkfd_gpuvm_free_memory_of_gpu() again resulting in double deletion. To avoid this - (a) Check if list is empty before deleting it (b) Rearragne amdgpu_amdkfd_gpuvm_free_memory_of_gpu() such that it can be safely called again if it returns failure the first time. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6ba60345f45eaf7cb4f89105d26083a4b9fd1cba)
2026-02-03drm/amd/display: remove assert around dpp_base replacementMelissa Wen
There is nothing wrong if in_shaper_func type is DISTRIBUTED POINTS. Remove the assert placed for a TODO to avoid misinterpretations. Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1714dcc4c2c53e41190896eba263ed6328bcf415)
2026-02-03drm/amd/display: extend delta clamping logic to CM3 LUT helperMelissa Wen
Commit 27fc10d1095f ("drm/amd/display: Fix the delta clamping for shaper LUT") fixed banding when using plane shaper LUT in DCN10 CM helper. The problem is also present in DCN30 CM helper, fix banding by extending the same bug delta clamping fix to CM3. Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 0274a54897f356f9c78767c4a2a5863f7dde90c6)
2026-02-03drm/amd/display: fix wrong color value mapping on MCM shaper LUTMelissa Wen
Some shimmer/colorful points appears when using the steamOS color pipeline for HDR on gaming with DCN32. These points look like black values being wrongly mapped to red/blue/green values. It was caused because the number of hw points in regular LUTs and in a shaper LUT was treated as the same. DCN3+ regular LUTs have 257 bases and implicit deltas (i.e. HW calculates them), but shaper LUT is a special case: it has 256 bases and 256 deltas, as in DCN1-2 regular LUTs, and outputs 14-bit values. Fix that by setting by decreasing in 1 the number of HW points computed in the LUT segmentation so that shaper LUT (i.e. fixpoint == true) keeps the same DCN10 CM logic and regular LUTs go with `hw_points + 1`. CC: Krunoslav Kovac <Krunoslav.Kovac@amd.com> Fixes: 4d5fd3d08ea9 ("drm/amd/display: PQ tail accuracy") Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5006505b19a2119e71c008044d59f6d753c858b9)
2026-02-03Revert "drm/amd: Check if ASPM is enabled from PCIe subsystem"Bert Karwatzki
This reverts commit 7294863a6f01248d72b61d38478978d638641bee. This commit was erroneously applied again after commit 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") removed it, leading to very hard to debug crashes, when used with a system with two AMD GPUs of which only one supports ASPM. Link: https://lore.kernel.org/linux-acpi/20251006120944.7880-1-spasswolf@web.de/ Link: https://github.com/acpica/acpica/issues/1060 Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 97a9689300eb2b393ba5efc17c8e5db835917080) Cc: stable@vger.kernel.org
2026-02-03drm/amd: Set minimum version for set_hw_resource_1 on gfx11 to 0x52Mario Limonciello
commit f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") caused a dependency on new enough MES firmware to use amdgpu. This was fixed on most gfx11 and gfx12 hardware with commit 0180e0a5dd5c ("drm/amdgpu/mes: add compatibility checks for set_hw_resource_1"), but this left out that GC 11.0.4 had breakage at MES 0x51. Bump the requirement to 0x52 instead. Reported-by: danijel@nausys.com Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4576 Fixes: f81cd793119e ("drm/amd/amdgpu: Fix MES init sequence") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c2d2ccc85faf8cc6934d50c18e43097eb453ade2) Cc: stable@vger.kernel.org
2026-02-03drm/amd/pm: Remove buffer allocation in SMUv13.0.6Lijo Lazar
No longer required to allocate temporary buffer while fetching metrcis, instead, use metrics table cache data directly. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amdgpu: Skip vcn poison irq release on VFLijo Lazar
VF doesn't enable VCN poison irq in VCNv2.5. Skip releasing it and avoid call trace during deinitialization. [ 71.913601] [drm] clean up the vf2pf work item [ 71.915088] ------------[ cut here ]------------ [ 71.915092] WARNING: CPU: 3 PID: 1079 at /tmp/amd.aFkFvSQl/amd/amdgpu/amdgpu_irq.c:641 amdgpu_irq_put+0xc6/0xe0 [amdgpu] [ 71.915355] Modules linked in: amdgpu(OE-) amddrm_ttm_helper(OE) amdttm(OE) amddrm_buddy(OE) amdxcp(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) drm_suballoc_helper drm_display_helper cec rc_core i2c_algo_bit video wmi binfmt_misc nls_iso8859_1 intel_rapl_msr intel_rapl_common input_leds joydev serio_raw mac_hid qemu_fw_cfg sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel usbhid 8139too sha256_ssse3 sha1_ssse3 hid psmouse bochs i2c_i801 ahci drm_vram_helper libahci i2c_smbus lpc_ich drm_ttm_helper 8139cp mii ttm aesni_intel crypto_simd cryptd [ 71.915484] CPU: 3 PID: 1079 Comm: rmmod Tainted: G OE 6.8.0-87-generic #88~22.04.1-Ubuntu [ 71.915489] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.3-2.el9_5.1 04/01/2014 [ 71.915492] RIP: 0010:amdgpu_irq_put+0xc6/0xe0 [amdgpu] [ 71.915768] Code: 75 84 b8 ea ff ff ff eb d4 44 89 ea 48 89 de 4c 89 e7 e8 fd fc ff ff 5b 41 5c 41 5d 41 5e 5d 31 d2 31 f6 31 ff e9 55 30 3b c7 <0f> 0b eb d4 b8 fe ff ff ff eb a8 e9 b7 3b 8a 00 66 2e 0f 1f 84 00 [ 71.915771] RSP: 0018:ffffcf0800eafa30 EFLAGS: 00010246 [ 71.915775] RAX: 0000000000000000 RBX: ffff891bda4b0668 RCX: 0000000000000000 [ 71.915777] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 71.915779] RBP: ffffcf0800eafa50 R08: 0000000000000000 R09: 0000000000000000 [ 71.915781] R10: 0000000000000000 R11: 0000000000000000 R12: ffff891bda480000 [ 71.915782] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 [ 71.915792] FS: 000070cff87c4c40(0000) GS:ffff893abfb80000(0000) knlGS:0000000000000000 [ 71.915795] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.915797] CR2: 00005fa13073e478 CR3: 000000010d634006 CR4: 0000000000770ef0 [ 71.915800] PKRU: 55555554 [ 71.915802] Call Trace: [ 71.915805] <TASK> [ 71.915809] vcn_v2_5_hw_fini+0x19e/0x1e0 [amdgpu] Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Mangesh Gadre <Mangesh.Gadre@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amd/display: remove assert around dpp_base replacementMelissa Wen
There is nothing wrong if in_shaper_func type is DISTRIBUTED POINTS. Remove the assert placed for a TODO to avoid misinterpretations. Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amd/display: extend delta clamping logic to CM3 LUT helperMelissa Wen
Commit 27fc10d1095f ("drm/amd/display: Fix the delta clamping for shaper LUT") fixed banding when using plane shaper LUT in DCN10 CM helper. The problem is also present in DCN30 CM helper, fix banding by extending the same bug delta clamping fix to CM3. Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amd/display: fix wrong color value mapping on MCM shaper LUTMelissa Wen
Some shimmer/colorful points appears when using the steamOS color pipeline for HDR on gaming with DCN32. These points look like black values being wrongly mapped to red/blue/green values. It was caused because the number of hw points in regular LUTs and in a shaper LUT was treated as the same. DCN3+ regular LUTs have 257 bases and implicit deltas (i.e. HW calculates them), but shaper LUT is a special case: it has 256 bases and 256 deltas, as in DCN1-2 regular LUTs, and outputs 14-bit values. Fix that by setting by decreasing in 1 the number of HW points computed in the LUT segmentation so that shaper LUT (i.e. fixpoint == true) keeps the same DCN10 CM logic and regular LUTs go with `hw_points + 1`. CC: Krunoslav Kovac <Krunoslav.Kovac@amd.com> Fixes: 4d5fd3d08ea9 ("drm/amd/display: PQ tail accuracy") Signed-off-by: Melissa Wen <mwen@igalia.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amdgpu: Fix double deletion of validate_listHarish Kasiviswanathan
If amdgpu_amdkfd_gpuvm_free_memory_of_gpu() fails after kgd_mem is removed from validate_list, the mem handle still lingers in the KFD idr. This means when process is terminated, kfd_process_free_outstanding_kfd_bos() will call amdgpu_amdkfd_gpuvm_free_memory_of_gpu() again resulting in double deletion. To avoid this - (a) Check if list is empty before deleting it (b) Rearragne amdgpu_amdkfd_gpuvm_free_memory_of_gpu() such that it can be safely called again if it returns failure the first time. Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amdgpu: Ignored various return codeAndrew Martin
The return code of a non void function should not be ignored. In cases where we do not care, the code needs to suppress it. Signed-off-by: Andrew Martin <andrew.martin@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amdgpu/psp_v15_0_8: Add get ras capabilityJinzhou Su
Add get ras capability for psp 15.0.8. v2:Remove APU type check and IP version check. Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amd/pm: Add default feature number definitionLijo Lazar
The number of default features could be different from the actual width of the bitmap. Use a different definition for it. Also increase the max width of bitmap to 128. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amd/pm: Change get_enabled_mask signatureLijo Lazar
Use smu_feature_bits instead of uint64_t pointer and operate on feature bits. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amd/pm: Use feature bits data structureLijo Lazar
Feature bits are not necessarily restricted to 64-bits. Use smu_feature_bits data structure to represent feature mask for checking DPM status. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03Revert "drm/amd: Check if ASPM is enabled from PCIe subsystem"Bert Karwatzki
This reverts commit 7294863a6f01248d72b61d38478978d638641bee. This commit was erroneously applied again after commit 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") removed it, leading to very hard to debug crashes, when used with a system with two AMD GPUs of which only one supports ASPM. Link: https://lore.kernel.org/linux-acpi/20251006120944.7880-1-spasswolf@web.de/ Link: https://github.com/acpica/acpica/issues/1060 Fixes: 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-02-03drm/amdgpu: statistic xgmi training error countStanley.Yang
Report xgmi training error uncorrectable error count. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>