kernel/drivers/accel/amdxdna/aie2_ctx.c, branch linux-rolling-stable

accel/amdxdna: Fix runtime suspend deadlock when there is pending job

2026-03-19T15:14:56Z

[ Upstream commit 6b13cb8f48a42ddf6dd98865b673a82e37ff238b ] The runtime suspend callback drains the running job workqueue before suspending the device. If a job is still executing and calls pm_runtime_resume_and_get(), it can deadlock with the runtime suspend path. Fix this by moving pm_runtime_resume_and_get() from the job execution routine to the job submission routine, ensuring the device is resumed before the job is queued and avoiding the deadlock during runtime suspend. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260310180058.336348-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Fill invalid payload for failed command

2026-03-12T11:09:45Z

[ Upstream commit 89ff45359abbf9d8d3c4aa3f5a57ed0be82b5a12 ] Newer userspace applications may read the payload of a failed command to obtain detailed error information. However, the driver and old firmware versions may not support returning advanced error information. In this case, initialize the command payload with an invalid value so userspace can detect that no detailed error information is available. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260227004841.3080241-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Fix command hang on suspended hardware context

2026-03-12T11:09:14Z

[ Upstream commit 07efce5a6611af6714ea3ef65694e0c8dd7e44f5 ] When a hardware context is suspended, the job scheduler is stopped. If a command is submitted while the context is suspended, the job is queued in the scheduler but aie2_sched_job_run() is never invoked to restart the hardware context. As a result, the command hangs. Fix this by modifying the hardware context suspend routine to keep the job scheduler running so that queued jobs can trigger context restart properly. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260211205341.722982-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Fix dead lock for suspend and resume

2026-03-12T11:09:14Z

[ Upstream commit 1aa82181a3c285c7351523d587f7981ae4c015c8 ] When an application issues a query IOCTL while auto suspend is running, a deadlock can occur. The query path holds dev_lock and then calls pm_runtime_resume_and_get(), which waits for the ongoing suspend to complete. Meanwhile, the suspend callback attempts to acquire dev_lock and blocks, resulting in a deadlock. Fix this by releasing dev_lock before calling pm_runtime_resume_and_get() and reacquiring it after the call completes. Also acquire dev_lock in the resume callback to keep the locking consistent. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260211204644.722758-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Reduce log noise during process termination

2026-03-12T11:09:14Z

[ Upstream commit 57aa3917a3b3bd805a3679371f97a1ceda3c5510 ] During process termination, several error messages are logged that are not actual errors but expected conditions when a process is killed or interrupted. This creates unnecessary noise in the kernel log. The specific scenarios are: 1. HMM invalidation returns -ERESTARTSYS when the wait is interrupted by a signal during process cleanup. This is expected when a process is being terminated and should not be logged as an error. 2. Context destruction returns -ENODEV when the firmware or device has already stopped, which commonly occurs during cleanup if the device was already torn down. This is also an expected condition during orderly shutdown. Downgrade these expected error conditions from error level to debug level to reduce log noise while still keeping genuine errors visible. Fixes: 97f27573837e ("accel/amdxdna: Fix potential NULL pointer dereference in context cleanup") Reviewed-by: Lizhi Hou Signed-off-by: Mario Limonciello Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260210164521.1094274-3-mario.limonciello@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Switch to always use chained command

2026-03-12T11:09:13Z

[ Upstream commit c68a6af400ca80596e8c37de0a1cb564aa9da8a4 ] Preempt commands are only supported when submitted as chained commands. To ensure preempt support works consistently, always submit commands in chained command format. Set force_cmdlist to true so that single commands are filled using the chained command layout, enabling correct handling of preempt commands. Fixes: 3a0ff7b98af4 ("accel/amdxdna: Support preemption requests") Reviewed-by: Karol Wachowski Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260206060251.4050512-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Move RPM resume into job run function

2026-02-26T23:01:03Z

[ Upstream commit 69674c1c704c0199ca7a3947f3cdcd575973175d ] Currently, amdxdna_pm_resume_get() is called during job creation, and amdxdna_pm_suspend_put() is called when the hardware notifies job completion. If a job is canceled before it is run, no hardware completion notification is generated, resulting in an unbalanced runtime PM resume/suspend pair. Fix this by moving amdxdna_pm_resume_get() to the job run path, ensuring runtime PM is only resumed for jobs that are actually executed. Fixes: 063db451832b ("accel/amdxdna: Enhance runtime power management") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260204171118.3165607-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Fix incorrect error code returned for failed chain command

2026-02-26T23:01:03Z

[ Upstream commit 750817a7c41de083ca5d73052e97bb7b67d7c394 ] The driver currently returns an incorrect error code when a chain command fails. In this case, ERT_CMD_STATE_ERROR is expected to be reported for failed chain commands. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) Reviewed-by: Maciej Falkowski Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260203184037.2751889-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Remove hardware context status

2026-02-26T23:01:03Z

[ Upstream commit b853007fdcdd64b49601a993c2b30c28279ae15d ] One newly supported command does not require hardware context configuration to be performed upfront. As a result, checking hardware context status causes this command to fail incorrectly. Remove hardware context status handling entirely. For other commands, if userspace submits a request without configuring the hardware context first, the firmware will report an error or time out as appropriate. Fixes: aac243092b70 ("accel/amdxdna: Add command execution") Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20260202212450.2681273-1-lizhi.hou@amd.com Signed-off-by: Sasha Levin

accel/amdxdna: Enable temporal sharing only mode

2026-02-26T23:01:03Z

[ Upstream commit 7818618a09a06320f409571bf28801ccfe7e0a30 ] Newer firmware versions prefer temporal sharing only mode. In this mode, the driver no longer needs to manage AIE array column allocation. Instead, a new field, num_unused_col, is added to the hardware context creation request to specify how many columns will not be used by this hardware context. Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Lizhi Hou Link: https://patch.msgid.link/20251217191150.2145937-1-lizhi.hou@amd.com Stable-dep-of: b853007fdcdd ("accel/amdxdna: Remove hardware context status") Signed-off-by: Sasha Levin