kernel/drivers/accel/amdxdna/amdxdna_ctx.c, branch linux-rolling-stable

accel/amdxdna: Fix runtime suspend deadlock when there is pending job

2026-03-19T15:14:56Z

[ Upstream commit 6b13cb8f48a42ddf6dd98865b673a82e37ff238b ] The runtime suspend callback drains the running job workqueue before suspending the device. If a job is still executing and calls pm_runtime_resume_and_get(), it can deadlock with the runtime suspend path. Fix this by moving pm_runtime_resume_and_get() from the job execution routine to the job submission routine, ensuring the device is resumed before the job is queued and avoiding the deadlock during runtime suspend. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260310180058.336348-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Fill invalid payload for failed command

2026-03-12T11:09:45Z

[ Upstream commit 89ff45359abbf9d8d3c4aa3f5a57ed0be82b5a12 ] Newer userspace applications may read the payload of a failed command to obtain detailed error information. However, the driver and old firmware versions may not support returning advanced error information. In this case, initialize the command payload with an invalid value so userspace can detect that no detailed error information is available. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260227004841.3080241-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Validate command buffer payload count

2026-03-12T11:09:15Z

[ Upstream commit 901ec3470994006bc8dd02399e16b675566c3416 ] The count field in the command header is used to determine the valid payload size. Verify that the valid payload does not exceed the remaining buffer space. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260219211946.1920485-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Fix dead lock for suspend and resume

2026-03-12T11:09:14Z

[ Upstream commit 1aa82181a3c285c7351523d587f7981ae4c015c8 ] When an application issues a query IOCTL while auto suspend is running, a deadlock can occur. The query path holds dev_lock and then calls pm_runtime_resume_and_get(), which waits for the ongoing suspend to complete. Meanwhile, the suspend callback attempts to acquire dev_lock and blocks, resulting in a deadlock. Fix this by releasing dev_lock before calling pm_runtime_resume_and_get() and reacquiring it after the call completes. Also acquire dev_lock in the resume callback to keep the locking consistent. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260211204644.722758-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Fix dma_fence leak when job is canceled

2025-11-06T17:23:42Z

Currently, dma_fence_put(job->fence) is called in job notification callback. However, if a job is canceled, the notification callback is never invoked, leading to a memory leak. Move dma_fence_put(job->fence) to the job cleanup function to ensure the fence is always released. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20251105194140.1004314-1-lizhi.hou@amd.com

accel/amdxdna: Use MSG_OP_CHAIN_EXEC_NPU when supported

2025-11-03T17:20:39Z

MSG_OP_CHAIN_EXEC_NPU is a unified mailbox message that replaces MSG_OP_CHAIN_EXEC_BUFFER_CF and MSG_OP_CHAIN_EXEC_DPU. Add driver logic to check firmware version, and if MSG_OP_CHAIN_EXEC_NPU is supported, uses it to submit firmware commands. Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20251031014700.2919349-1-lizhi.hou@amd.com

accel/amdxdna: Support firmware debug buffer

2025-10-20T16:07:12Z

To collect firmware debug information, the userspace application allocates a AMDXDNA_BO_DEV buffer object through DRM_IOCTL_AMDXDNA_CREATE_BO. Then it associates the buffer with the hardware context through DRM_IOCTL_AMDXDNA_CONFIG_HWCTX which requests firmware to bind the buffer through a mailbox command. The firmware then writes the debug data into this buffer. The buffer can be mapped into userspace so that applications can retrieve and analyze the firmware debug information. Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://lore.kernel.org/r/20251016203016.819441-1-lizhi.hou@amd.com

accel/amdxdna: Enhance runtime power management

2025-09-24T20:47:59Z

Currently, pm_runtime_resume_and_get() is invoked in the driver's open callback, and pm_runtime_put_autosuspend() is called in the close callback. As a result, the device remains active whenever an application opens it, even if no I/O is performed, leading to unnecessary power consumption. Move the runtime PM calls to the AIE2 callbacks that actually interact with the hardware. The device will automatically suspend after 5 seconds of inactivity (no hardware accesses and no pending commands), and it will be resumed on the next hardware access. Reviewed-by: Karol Wachowski Signed-off-by: Lizhi Hou Link: https://lore.kernel.org/r/20250923152229.1303625-1-lizhi.hou@amd.com

accel/amdxdna: Add a function to walk hardware contexts

2025-08-18T15:35:57Z

Walking hardware contexts created by a process is duplicated in multiple spots. Add a function, amdxdna_hwctx_walk(), and replace all spots. hwctx_srcu and dev_lock are good enough to protect hardware context list. Remove hwctx_lock. Reviewed-by: Mario Limonciello Signed-off-by: Lizhi Hou Link: https://lore.kernel.org/r/20250815171634.3417487-1-lizhi.hou@amd.com

accel/amdxdna: Unify pm and rpm suspend and resume callbacks

2025-08-06T17:31:55Z

The suspend and resume callbacks for pm and runtime pm should be same. During suspending, it needs to stop all hardware contexts first. And the hardware contexts will be restarted after the device is resumed. Reviewed-by: Mario Limonciello (AMD) Reviewed-by: Maciej Falkowski Signed-off-by: Lizhi Hou Link: https://lore.kernel.org/r/20250803191450.1568851-1-lizhi.hou@amd.com