summaryrefslogtreecommitdiff
path: root/include/trace/events/sched_ext.h
AgeCommit message (Collapse)Author
2025-11-12sched_ext: Implement load balancer for bypass modeTejun Heo
In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well in most cases, there is a failure mode where a BPF scheduler can skew task placement severely before triggering bypass in highly over-saturated systems. If most tasks end up concentrated on a few CPUs, those CPUs can accumulate queues that are too long to drain in a reasonable time, leading to RCU stalls and hung tasks. Implement a simple timer-based load balancer that redistributes tasks across CPUs within each NUMA node. The balancer runs periodically (default 500ms, tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded CPUs to underloaded ones. When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks to avoid dropping and reacquiring the donor DSQ lock on each iteration, as donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load balancer timer function reads scx_bypass_depth locklessly to check whether bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair with the READ_ONCE() in the timer function. This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost all tasks end up on CPU0 creating severe imbalance. Without the load balancer, disabling the scheduler can lead to RCU stalls and hung tasks, taking a very long time to complete. With the load balancer, disable completes in about a second. The load balancing operation can be monitored using the sched_ext_bypass_lb tracepoint and disabled by setting bypass_lb_intv_us to 0. v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked() to prevent races with dispatch_dequeue() (Andrea Righi). Cc: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-04sched_ext: Add trace point to track sched_ext core eventsChangwoo Min
Add tracing support to track sched_ext core events (/sched_ext/sched_ext_event). This may be useful for debugging sched_ext schedulers that trigger a particular event. The trace point can be used as other trace points, so it can be used in, for example, `perf trace` and BPF programs, as follows: ====== $> sudo perf trace -e sched_ext:sched_ext_event --filter 'name == "SCX_EV_ENQ_SLICE_DFL"' ====== ====== struct tp_sched_ext_event { struct trace_entry ent; u32 __data_loc_name; s64 delta; }; SEC("tracepoint/sched_ext/sched_ext_event") int rtp_add_event(struct tp_sched_ext_event *ctx) { char event_name[128]; unsigned short offset = ctx->__data_loc_name & 0xFFFF; bpf_probe_read_str((void *)event_name, 128, (char *)ctx + offset); bpf_printk("name %s delta %lld", event_name, ctx->delta); return 0; } ====== Signed-off-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-18sched_ext: Print debug dump after an error exitTejun Heo
If a BPF scheduler triggers an error, the scheduler is aborted and the system is reverted to the built-in scheduler. In the process, a lot of information which may be useful for figuring out what happened can be lost. This patch adds debug dump which captures information which may be useful for debugging including runqueue and runnable thread states at the time of failure. The following shows a debug dump after triggering the watchdog: root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100 stats : enq=1 dsp=0 delta=1 deq=0 stats : enq=90 dsp=90 delta=0 deq=0 stats : enq=156 dsp=156 delta=0 deq=0 stats : enq=218 dsp=218 delta=0 deq=0 stats : enq=255 dsp=255 delta=0 deq=0 stats : enq=271 dsp=271 delta=0 deq=0 stats : enq=284 dsp=284 delta=0 deq=0 stats : enq=293 dsp=293 delta=0 deq=0 DEBUG DUMP ================================================================================ kworker/u32:12[320] triggered exit kind 1026: runnable task stall (stress[1530] failed to run for 6.841s) Backtrace: scx_watchdog_workfn+0x136/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 QMAP FIFO[0]: QMAP FIFO[1]: QMAP FIFO[2]: 1436 QMAP FIFO[3]: QMAP FIFO[4]: CPU states ---------- CPU 0 : nr_run=1 ops_qseq=244 curr=swapper/0[0] class=idle_sched_class QMAP: dsp_idx=1 dsp_cnt=0 R stress[1530] -6841ms scx_state/flags=3/0x1 ops_state/qseq=2/20 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=0 asm_sysvec_apic_timer_interrupt+0x16/0x20 CPU 2 : nr_run=2 ops_qseq=142 curr=swapper/2[0] class=idle_sched_class QMAP: dsp_idx=1 dsp_cnt=0 R sshd[1703] -5905ms scx_state/flags=3/0x9 ops_state/qseq=2/88 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=1 __x64_sys_ppoll+0xf6/0x120 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e R fish[1539] -4141ms scx_state/flags=3/0x9 ops_state/qseq=2/124 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=1 futex_wait+0x60/0xe0 do_futex+0x109/0x180 __x64_sys_futex+0x117/0x190 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e CPU 3 : nr_run=2 ops_qseq=162 curr=kworker/u32:12[320] class=ext_sched_class QMAP: dsp_idx=1 dsp_cnt=0 *R kworker/u32:12[320] +0ms scx_state/flags=3/0xd ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=0 scx_dump_state+0x613/0x6f0 scx_ops_error_irq_workfn+0x1f/0x40 irq_work_run_list+0x82/0xd0 irq_work_run+0x14/0x30 __sysvec_irq_work+0x40/0x140 sysvec_irq_work+0x60/0x70 asm_sysvec_irq_work+0x16/0x20 scx_watchdog_workfn+0x15f/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 R kworker/3:2[1436] +0ms scx_state/flags=3/0x9 ops_state/qseq=2/160 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=08 QMAP: force_local=0 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 CPU 7 : nr_run=0 ops_qseq=76 curr=swapper/7[0] class=idle_sched_class ================================================================================ EXIT: runnable task stall (stress[1530] failed to run for 6.841s) It shows that CPU 3 was running the watchdog when it triggered the error condition and the scx_qmap thread has been queued on CPU 0 for over 5 seconds but failed to run. It also prints out scx_qmap specific information - e.g. which tasks are queued on each FIFO and so on using the dump_*() ops. This dump has proved pretty useful for developing and debugging BPF schedulers. Debug dump is generated automatically when the BPF scheduler exits due to an error. The debug buffer used in such cases is determined by sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns the available buffer, the output is truncated and marked accordingly. Debug dump output can also be read through the sched_ext_dump tracepoint. When read through the tracepoint, there is no length limit. SysRq-D can be used to trigger debug dump at any time while a BPF scheduler is loaded. This is non-destructive - the scheduler keeps running afterwards. The output can be read through the sched_ext_dump tracepoint. v2: - The size of exit debug dump buffer can now be customized using sched_ext_ops.exit_dump_len. - sched_ext_ops.dump*() added to enable dumping of BPF scheduler specific information. - Tracpoint output and SysRq-D triggering added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com>