kernel/drivers/gpu/drm/i915/intel_ringbuffer.c, branch linux-5.1.y

drm/i915/ringbuffer: EMIT_INVALIDATE before switch context

2019-07-10T07:52:30Z

commit c84c9029d782a3a0d2a7f0522ecb907314d43e2c upstream. Despite what I think the prm recommends, commit f2253bd9859b ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") turned out to be a huge mistake when enabling Ironlake contexts as the GPU would hang on either a MI_FLUSH or PIPE_CONTROL immediately following the MI_SET_CONTEXT of an active mesa context (more vanilla contexts, e.g. simple rendercopies with igt, do not suffer). Ville found the following clue, "[DevCTG+]: For the invalidate operation of the pipe control, the following pointers are affected. The invalidate operation affects the restore of these packets. If the pipe control invalidate operation is completed before the context save, the indirect pointers will not be restored from memory. 1. Pipeline State Pointer 2. Media State Pointer 3. Constant Buffer Packet" which suggests by us emitting the INVALIDATE prior to the MI_SET_CONTEXT, we prevent the context-restore from chasing the dangling pointers within the image, and explains why this likely prevents the GPU hang. Signed-off-by: Chris Wilson Reviewed-by: Ville Syrjälä Link: https://patchwork.freedesktop.org/patch/msgid/20190419111749.3910-1-chris@chris-wilson.co.uk (cherry picked from commit 928f8f42310f244501a7c70daac82c196112c190 in drm-intel-next) Cc: stable@vger.kernel.org Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111014 Fixes: f2253bd9859b ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") Signed-off-by: Daniel Vetter Signed-off-by: Greg Kroah-Hartman

drm/i915: Hack and slash, throttle execbuffer hogs

2019-02-07T16:13:21Z

Apply backpressure to hogs that emit requests faster than the GPU can process them by waiting for their ring to be less than half-full before proceeding with taking the struct_mutex. This is a gross hack to apply throttling backpressure, the long term goal is to remove the struct_mutex contention so that each client naturally waits, preferably in an asynchronous, nonblocking fashion (pipelined operations for the win), for their own resources and never blocks another client within the driver at least. (Realtime priority goals would extend to ensuring that resource contention favours high priority clients as well.) This patch only limits excessive request production and does not attempt to throttle clients that block waiting for eviction (either global GTT or system memory) or any other global resources, see above for the long term goal. No microbenchmarks are harmed (to the best of my knowledge). Testcase: igt/gem_exec_schedule/pi-ringfull-* Signed-off-by: Chris Wilson Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: John Harrison Reviewed-by: Joonas Lahtinen Link: https://patchwork.freedesktop.org/patch/msgid/20190207071829.5574-1-chris@chris-wilson.co.uk

drm/i915: Replace global breadcrumbs with per-context interrupt tracking

2019-01-29T21:45:22Z

A few years ago, see commit 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd"), the issue of handling multiple clients waiting in parallel was brought to our attention. The requirement was that every client should be woken immediately upon its request being signaled, without incurring any cpu overhead. To handle certain fragility of our hw meant that we could not do a simple check inside the irq handler (some generations required almost unbounded delays before we could be sure of seqno coherency) and so request completion checking required delegation. Before commit 688e6c725816, the solution was simple. Every client waiting on a request would be woken on every interrupt and each would do a heavyweight check to see if their request was complete. Commit 688e6c725816 introduced an rbtree so that only the earliest waiter on the global timeline would woken, and would wake the next and so on. (Along with various complications to handle requests being reordered along the global timeline, and also a requirement for kthread to provide a delegate for fence signaling that had no process context.) The global rbtree depends on knowing the execution timeline (and global seqno). Without knowing that order, we must instead check all contexts queued to the HW to see which may have advanced. We trim that list by only checking queued contexts that are being waited on, but still we keep a list of all active contexts and their active signalers that we inspect from inside the irq handler. By moving the waiters onto the fence signal list, we can combine the client wakeup with the dma_fence signaling (a dramatic reduction in complexity, but does require the HW being coherent, the seqno must be visible from the cpu before the interrupt is raised - we keep a timer backup just in case). Having previously fixed all the issues with irq-seqno serialisation (by inserting delays onto the GPU after each request instead of random delays on the CPU after each interrupt), we can rely on the seqno state to perfom direct wakeups from the interrupt handler. This allows us to preserve our single context switch behaviour of the current routine, with the only downside that we lose the RT priority sorting of wakeups. In general, direct wakeup latency of multiple clients is about the same (about 10% better in most cases) with a reduction in total CPU time spent in the waiter (about 20-50% depending on gen). Average herd behaviour is improved, but at the cost of not delegating wakeups on task_prio. v2: Capture fence signaling state for error state and add comments to warm even the most cold of hearts. v3: Check if the request is still active before busywaiting v4: Reduce the amount of pointer misdirection with list_for_each_safe and using a local i915_request variable inside the loops v5: Add a missing pluralisation to a purely informative selftest message. References: 688e6c725816 ("drm/i915: Slaughter the thundering i915_wait_request herd") Signed-off-by: Chris Wilson Cc: Tvrtko Ursulin Reviewed-by: Tvrtko Ursulin Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk

drm/i915: Identify active requests

2019-01-29T19:59:59Z

To allow requests to forgo a common execution timeline, one question we need to be able to answer is "is this request running?". To track whether a request has started on HW, we can emit a breadcrumb at the beginning of the request and check its timeline's HWSP to see if the breadcrumb has advanced past the start of this request. (This is in contrast to the global timeline where we need only ask if we are on the global timeline and if the timeline has advanced past the end of the previous request.) There is still confusion from a preempted request, which has already started but relinquished the HW to a high priority request. For the common case, this discrepancy should be negligible. However, for identification of hung requests, knowing which one was running at the time of the hang will be much more important. Signed-off-by: Chris Wilson Reviewed-by: Tvrtko Ursulin Link: https://patchwork.freedesktop.org/patch/msgid/20190129185452.20989-2-chris@chris-wilson.co.uk

drm/i915: Track the context's seqno in its own timeline HWSP

2019-01-28T19:07:09Z

Now that we have allocated ourselves a cacheline to store a breadcrumb, we can emit a write from the GPU into the timeline's HWSP of the per-context seqno as we complete each request. This drops the mirroring of the per-engine HWSP and allows each context to operate independently. We do not need to unwind the per-context timeline, and so requests are always consistent with the timeline breadcrumb, greatly simplifying the completion checks as we no longer need to be concerned about the global_seqno changing mid check. One complication though is that we have to be wary that the request may outlive the HWSP and so avoid touching the potentially danging pointer after we have retired the fence. We also have to guard our access of the HWSP with RCU, the release of the obj->mm.pages should already be RCU-safe. At this point, we are emitting both per-context and global seqno and still using the single per-engine execution timeline for resolving interrupts. v2: s/fake_complete/mark_complete/ Signed-off-by: Chris Wilson Reviewed-by: Tvrtko Ursulin Link: https://patchwork.freedesktop.org/patch/msgid/20190128181812.22804-5-chris@chris-wilson.co.uk

drm/i915: Allocate a status page for each timeline

2019-01-28T19:07:02Z

Allocate a page for use as a status page by a group of timelines, as we only need a dword of storage for each (rounded up to the cacheline for safety) we can pack multiple timelines into the same page. Each timeline will then be able to track its own HW seqno. v2: Reuse the common per-engine HWSP for the solitary ringbuffer timeline, so that we do not have to emit (using per-gen specialised vfuncs) the breadcrumb into the distinct timeline HWSP and instead can keep on using the common MI_STORE_DWORD_INDEX. However, to maintain the sleight-of-hand for the global/per-context seqno switchover, we will store both temporarily (and so use a custom offset for the shared timeline HWSP until the switch over). v3: Keep things simple and allocate a page for each timeline, page sharing comes next. v4: I was caught repeating the same MI_STORE_DWORD_IMM over and over again in selftests. v5: And caught red handed copying create timeline + check. Signed-off-by: Chris Wilson Reviewed-by: Tvrtko Ursulin Link: https://patchwork.freedesktop.org/patch/msgid/20190128181812.22804-3-chris@chris-wilson.co.uk

drm/i915: Always allocate an object/vma for the HWSP

2019-01-28T16:24:19Z

Currently we only allocate an object and vma if we are using a GGTT virtual HWSP, and a plain struct page for a physical HWSP. For convenience later on with global timelines, it will be useful to always have the status page being tracked by a struct i915_vma. Make it so. Signed-off-by: Chris Wilson Reviewed-by: Matthew Auld Link: https://patchwork.freedesktop.org/patch/msgid/20190128102356.15037-4-chris@chris-wilson.co.uk

drm/i915: Remove GPU reset dependence on struct_mutex

2019-01-25T14:27:22Z

Now that the submission backends are controlled via their own spinlocks, with a wave of a magic wand we can lift the struct_mutex requirement around GPU reset. That is we allow the submission frontend (userspace) to keep on submitting while we process the GPU reset as we can suspend the backend independently. The major change is around the backoff/handoff strategy for performing the reset. With no mutex deadlock, we no longer have to coordinate with any waiter, and just perform the reset immediately. Testcase: igt/gem_mmap_gtt/hang # regresses Signed-off-by: Chris Wilson Reviewed-by: Mika Kuoppala Link: https://patchwork.freedesktop.org/patch/msgid/20190125132230.22221-3-chris@chris-wilson.co.uk

drm/i915: Remove manual breadcumb counting

2019-01-25T12:53:13Z

Now that we know we measure the size of the engine->emit_breadcrumb() correctly, we can remove the previous manual counting. Signed-off-by: Chris Wilson Reviewed-by: Mika Kuoppala Link: https://patchwork.freedesktop.org/patch/msgid/20190125120005.25191-1-chris@chris-wilson.co.uk

drm/i915: Measure the required reserved size for request emission

2019-01-25T11:19:39Z

Instead of tediously and fragilely counting up the number of dwords required to emit the breadcrumb to seal a request, fake a request and measure it automatically once during engine setup. The downside is that this requires a fair amount of mocking to create a proper breadcrumb. Still, should be less error prone in future as the breadcrumb size fluctuates! Signed-off-by: Chris Wilson Reviewed-by: Mika Kuoppala Link: https://patchwork.freedesktop.org/patch/msgid/20190125100520.20163-1-chris@chris-wilson.co.uk