summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-03-04mm/slab: change stride type from unsigned short to unsigned intHarry Yoo
Commit 7a8e71bc619d ("mm/slab: use stride to access slabobj_ext") defined the type of slab->stride as unsigned short, because the author initially planned to store stride within the lower 16 bits of the page_type field, but later stored it in unused bits in the counters field instead. However, the idea of having only 2-byte stride turned out to be a serious mistake. On systems with 64k pages, order-1 pages are 128k, which is larger than USHRT_MAX. It triggers a debug warning because s->size is 128k while stride, truncated to 2 bytes, becomes zero: ------------[ cut here ]------------ Warning! stride (0) != s->size (131072) WARNING: mm/slub.c:2231 at alloc_slab_obj_exts_early.constprop.0+0x524/0x534, CPU#6: systemd-sysctl/307 Modules linked in: CPU: 6 UID: 0 PID: 307 Comm: systemd-sysctl Not tainted 7.0.0-rc1+ #6 PREEMPTLAZY Hardware name: IBM,9009-22A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW950.E0 (VL950_179) hv:phyp pSeries NIP: c0000000008a9ac0 LR: c0000000008a9abc CTR: 0000000000000000 REGS: c0000000141f7390 TRAP: 0700 Not tainted (7.0.0-rc1+) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28004400 XER: 00000005 CFAR: c000000000279318 IRQMASK: 0 GPR00: c0000000008a9abc c0000000141f7630 c00000000252a300 c00000001427b200 GPR04: 0000000000000004 0000000000000000 c000000000278fd0 0000000000000000 GPR08: fffffffffffe0000 0000000000000000 0000000000000000 0000000022004400 GPR12: c000000000f644b0 c000000017ff8f00 0000000000000000 0000000000000000 GPR16: 0000000000000000 c0000000141f7aa0 0000000000000000 c0000000141f7a88 GPR20: 0000000000000000 0000000000400cc0 ffffffffffffffff c00000001427b180 GPR24: 0000000000000004 00000000000c0cc0 c000000004e89a20 c00000005de90011 GPR28: 0000000000010010 c00000005df00000 c000000006017f80 c00c000000177a00 NIP [c0000000008a9ac0] alloc_slab_obj_exts_early.constprop.0+0x524/0x534 LR [c0000000008a9abc] alloc_slab_obj_exts_early.constprop.0+0x520/0x534 Call Trace: [c0000000141f7630] [c0000000008a9abc] alloc_slab_obj_exts_early.constprop.0+0x520/0x534 (unreliable) [c0000000141f76c0] [c0000000008aafbc] allocate_slab+0x154/0x94c [c0000000141f7760] [c0000000008b41c0] refill_objects+0x124/0x16c [c0000000141f77c0] [c0000000008b4be0] __pcs_replace_empty_main+0x2b0/0x444 [c0000000141f7810] [c0000000008b9600] __kvmalloc_node_noprof+0x840/0x914 [c0000000141f7900] [c000000000a3dd40] seq_read_iter+0x60c/0xb00 [c0000000141f7a10] [c000000000b36b24] proc_reg_read_iter+0x154/0x1fc [c0000000141f7a50] [c0000000009cee7c] vfs_read+0x39c/0x4e4 [c0000000141f7b30] [c0000000009d0214] ksys_read+0x9c/0x180 [c0000000141f7b90] [c00000000003a8d0] system_call_exception+0x1e0/0x4b0 [c0000000141f7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec This leads to slab_obj_ext() returning the first slabobj_ext or all objects and confuses the reference counting of object cgroups [1] and memory (un)charging for memory cgroups [2]. Fortunately, the counters field has 32 unused bits instead of 16 on 64-bit CPUs, which is wide enough to hold any value of s->size. Change the type to unsigned int. Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com [1] Closes: https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com [2] Fixes: 7a8e71bc619d ("mm/slab: use stride to access slabobj_ext") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260303135722.2680521-1-harry.yoo@oracle.com Reviewed-by: Hao Li <hao.li@linux.dev> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-03-04mm/slab: allow sheaf refill if blocking is not allowedVlastimil Babka (SUSE)
Ming Lei reported [1] a regression in the ublk null target benchmark due to sheaves. The profile shows that the alloc_from_pcs() fastpath fails and allocations fall back to ___slab_alloc(). It also shows the allocations happen through mempool_alloc(). The strategy of mempool_alloc() is to call the underlying allocator (here slab) without __GFP_DIRECT_RECLAIM first. This does not play well with __pcs_replace_empty_main() checking for gfpflags_allow_blocking() to decide if it should refill an empty sheaf or fallback to the slowpath, so we end up falling back. We could change the mempool strategy but there might be other paths doing the same ting. So instead allow sheaf refill when blocking is not allowed, changing the condition to gfpflags_allow_spinning(). The original condition was unnecessarily restrictive. Note this doesn't fully resolve the regression [1] as another component of that are memoryless nodes, which is to be addressed separately. Reported-by: Ming Lei <ming.lei@redhat.com> Fixes: e47c897a2949 ("slab: add sheaves to most caches") Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1] Link: https://patch.msgid.link/20260302095536.34062-2-vbabka@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-03-04ata: libata: cancel pending work after clearing deferred_qcNiklas Cassel
Syzbot reported a WARN_ON() in ata_scsi_deferred_qc_work(), caused by ap->ops->qc_defer() returning non-zero before issuing the deferred qc. ata_scsi_schedule_deferred_qc() is called during each command completion. This function will check if there is a deferred QC, and if ap->ops->qc_defer() returns zero, meaning that it is possible to queue the deferred qc at this time (without being deferred), then it will queue the work which will issue the deferred qc. Once the work get to run, which can potentially be a very long time after the work was scheduled, there is a WARN_ON() if ap->ops->qc_defer() returns non-zero. While we hold the ap->lock both when assigning and clearing deferred_qc, and the work itself holds the ap->lock, the code currently does not cancel the work after clearing the deferred qc. This means that the following scenario can happen: 1) One or several NCQ commands are queued. 2) A non-NCQ command is queued, gets stored in ap->deferred_qc. 3) Last NCQ command gets completed, work is queued to issue the deferred qc. 4) Timeout or error happens, ap->deferred_qc is cleared. The queued work is currently NOT canceled. 5) Port is reset. 6) One or several NCQ commands are queued. 7) A non-NCQ command is queued, gets stored in ap->deferred_qc. 8) Work is finally run. Yet at this time, there is still NCQ commands in flight. The work in 8) really belongs to the non-NCQ command in 2), not to the non-NCQ command in 7). The reason why the work is executed when it is not supposed to, is because it was never canceled when ap->deferred_qc was cleared in 4). Thus, ensure that we always cancel the work after clearing ap->deferred_qc. Another potential fix would have been to let ata_scsi_deferred_qc_work() do nothing if ap->ops->qc_defer() returns non-zero. However, canceling the work when clearing ap->deferred_qc seems slightly more logical, as we hold the ap->lock when clearing ap->deferred_qc, so we know that the work cannot be holding the lock. (The function could be waiting for the lock, but that is okay since it will do nothing if ap->deferred_qc is not set.) Reported-by: syzbot+bcaf842a1e8ead8dfb89@syzkaller.appspotmail.com Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation") Fixes: eddb98ad9364 ("ata: libata-eh: correctly handle deferred qc timeouts") Reviewed-by: Igor Pylypiv <ipylypiv@google.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org>
2026-03-04drm/sched: Fix kernel-doc warning for drm_sched_job_done()Yujie Liu
There is a kernel-doc warning for the scheduler: Warning: drivers/gpu/drm/scheduler/sched_main.c:367 function parameter 'result' not described in 'drm_sched_job_done' Fix the warning by describing the undocumented error code. Fixes: 539f9ee4b52a ("drm/scheduler: properly forward fence errors") Signed-off-by: Yujie Liu <yujie.liu@intel.com> [phasta: Flesh out commit message] Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://patch.msgid.link/20260227082452.1802922-1-yujie.liu@intel.com
2026-03-04xfs: fix race between healthmon unmount and read_iterDarrick J. Wong
xfs/1879 on one of my test VMs got stuck due to the xfs_io healthmon subcommand sleeping in wait_event_interruptible at: xfs_healthmon_read_iter+0x558/0x5f8 [xfs] vfs_read+0x248/0x320 ksys_read+0x78/0x120 Looking at xfs_healthmon_read_iter, in !O_NONBLOCK mode it will sleep until the mount cookie == DETACHED_MOUNT_COOKIE, there are events waiting to be formatted, or there are formatted events in the read buffer that could be copied to userspace. Poking into the running kernel, I see that there are zero events in the list, the read buffer is empty, and the mount cookie is indeed in DETACHED state. IOWs, xfs_healthmon_has_eventdata should have returned true, but instead we're asleep waiting for a wakeup. I think what happened here is that xfs_healthmon_read_iter and xfs_healthmon_unmount were racing with each other, and _read_iter lost the race. _unmount queued an unmount event, which woke up _read_iter. It found, formatted, and copied the event out to userspace. That cleared out the pending event list and emptied the read buffer. xfs_io then called read() again, so _has_eventdata decided that we should sleep on the empty event queue. Next, _unmount called xfs_healthmon_detach, which set the mount cookie to DETACHED. Unfortunately, it didn't call wake_up_all on the hm, so the wait_event_interruptible in the _read_iter thread remains asleep. That's why the test stalled. Fix this by moving the wake_up_all call to xfs_healthmon_detach. Fixes: b3a289a2a9397b ("xfs: create event queuing, formatting, and discovery infrastructure") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-04xfs: remove scratch field from struct xfs_gc_bioDamien Le Moal
The scratch field in struct xfs_gc_bio is unused. Remove it. Fixes: 102f444b57b3 ("xfs: rework zone GC buffer management") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-04ata: libata-core: Disable LPM on ST1000DM010-2EP102Maximilian Pezzullo
According to a user report, the ST1000DM010-2EP102 has problems with LPM, causing random system freezes. The drive belongs to the same BarraCuda family as the ST2000DM008-2FR102 which has the same issue. Cc: stable@vger.kernel.org Fixes: 7627a0edef54 ("ata: ahci: Drop low power policy board type") Reported-by: Filippo Baiamonte <filippo.ba03@bugzilla.kernel.org> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221163 Signed-off-by: Maximilian Pezzullo <maximilianpezzullo@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org>
2026-03-04power: sequencing: pcie-m2: Fix device node reference leak in probeFelix Gu
In pwrseq_pcie_m2_probe(), ctx->of_node acquires an explicit reference to the device node using of_node_get(), but there is no corresponding of_node_put() in the driver's error handling paths or removal. Since the ctx is tied to the lifecycle of the platform device, there is no need to hold an additional reference to the device's own of_node. Fixes: 52e7b5bd62ba ("power: sequencing: Add the Power Sequencing driver for the PCIe M.2 connectors") Signed-off-by: Felix Gu <ustc.gu@gmail.com> Link: https://patch.msgid.link/20260302-m2-v1-1-a6533e18aa69@gmail.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
2026-03-04RDMA/ionic: Preserve and set Ethernet source MAC after ib_ud_header_init()Abhijit Gangurde
ionic_build_hdr() populated the Ethernet source MAC (hdr->eth.smac_h) by passing the header’s storage directly to rdma_read_gid_l2_fields(). However, ib_ud_header_init() is called after that and re-initializes the UD header, which wipes the previously written smac_h. As a result, packets are emitted with an zero source MAC address on the wire. Correct the source MAC by reading the GID-derived smac into a temporary buffer and copy it after ib_ud_header_init() completes. Fixes: e8521822c733 ("RDMA/ionic: Register device ops for control path") Cc: stable@vger.kernel.org # 6.18 Signed-off-by: Abhijit Gangurde <abhijit.gangurde@amd.com> Link: https://patch.msgid.link/20260227061809.2979990-1-abhijit.gangurde@amd.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-04RDMA/irdma: Fix double free related to rereg_user_mrJacob Moroni
If IB_MR_REREG_TRANS is set during rereg_user_mr, the umem will be released and a new one will be allocated in irdma_rereg_mr_trans. If any step of irdma_rereg_mr_trans fails after the new umem is allocated, it releases the umem, but does not set iwmr->region to NULL. The problem is that this failure is propagated to the user, who will then call ibv_dereg_mr (as they should). Then, the dereg_mr path will see a non-NULL umem and attempt to call ib_umem_release again. Fix this by setting iwmr->region to NULL after ib_umem_release. Fixed: 5ac388db27c4 ("RDMA/irdma: Add support to re-register a memory region") Signed-off-by: Jacob Moroni <jmoroni@google.com> Link: https://patch.msgid.link/20260227152743.1183388-1-jmoroni@google.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-03-04selftests: tc-testing: fix list_categories() crash on list typeNaveen Anandhan
list_categories() builds a set directly from the 'category' field of each test case. Since 'category' is a list, set(map(...)) attempts to insert lists into a set, which raises: TypeError: unhashable type: 'list' Flatten category lists and collect unique category names using set.update() instead. Signed-off-by: Naveen Anandhan <mr.navi8680@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2026-03-04powerpc/crash: adjust the elfcorehdr sizeSourabh Jain
With crash hotplug support enabled, additional memory is allocated to the elfcorehdr kexec segment to accommodate resources added during memory hotplug events. However, the kdump FDT is not updated with the same size, which can result in elfcorehdr corruption in the kdump kernel. Update elf_headers_sz (the kimage member representing the size of the elfcorehdr kexec segment) to reflect the total memory allocated for the elfcorehdr segment instead of the elfcorehdr buffer size at the time of kdump load. This allows of_kexec_alloc_and_setup_fdt() to reserve the full elfcorehdr memory in the kdump FDT and prevents elfcorehdr corruption. Fixes: 849599b702ef8 ("powerpc/crash: add crash memory hotplug support") Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260227171801.2238847-1-sourabhjain@linux.ibm.com
2026-03-04powerpc/kexec/core: use big-endian types for crash variablesSourabh Jain
Use explicit word-sized big-endian types for kexec and crash related variables. This makes the endianness unambiguous and avoids type mismatches that trigger sparse warnings. The change addresses sparse warnings like below (seen on both 32-bit and 64-bit builds): CHECK ../arch/powerpc/kexec/core.c sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_base sparse: got restricted __be32 [usertype] sparse: warning: incorrect type in assignment (different base types) sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_size sparse: got restricted __be32 [usertype] sparse: warning: incorrect type in assignment (different base types) sparse: expected unsigned long long static [addressable] [toplevel] mem_limit sparse: got restricted __be32 [usertype] sparse: warning: incorrect type in assignment (different base types) sparse: expected unsigned int static [addressable] [toplevel] [usertype] kernel_end sparse: got restricted __be32 [usertype] No functional change intended. Fixes: ea961a828fe7 ("powerpc: Fix endian issues in kexec and crash dump code") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512221405.VHPKPjnp-lkp@intel.com/ Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251224151257.28672-1-sourabhjain@linux.ibm.com
2026-03-04powerpc/prom_init: Fixup missing #size-cells on PowerMac media-bay nodesRob Herring (Arm)
Similar to other PowerMac mac-io devices, the media-bay node is missing the "#size-cells" property. Depends-on: commit 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling") Reported-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251029174047.1620073-1-robh@kernel.org
2026-03-04powerpc: dts: fsl: Drop unused .dtsi filesRob Herring (Arm)
These files are not included by anything and therefore don't get built or tested. There's also no upstream driver for the interlaken-lac stuff. Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260128140222.1627203-1-robh@kernel.org
2026-03-04powerpc/uaccess: Fix inline assembly for clang build on PPC32Christophe Leroy (CS GROUP)
Test robot reports the following error with clang-16.0.6: In file included from kernel/rseq.c:75: include/linux/rseq_entry.h:141:3: error: invalid operand for instruction unsafe_get_user(offset, &ucs->post_commit_offset, efault); ^ include/linux/uaccess.h:608:2: note: expanded from macro 'unsafe_get_user' arch_unsafe_get_user(x, ptr, local_label); \ ^ arch/powerpc/include/asm/uaccess.h:518:2: note: expanded from macro 'arch_unsafe_get_user' __get_user_size_goto(__gu_val, __gu_addr, sizeof(*(p)), e); \ ^ arch/powerpc/include/asm/uaccess.h:284:2: note: expanded from macro '__get_user_size_goto' __get_user_size_allowed(x, ptr, size, __gus_retval); \ ^ arch/powerpc/include/asm/uaccess.h:275:10: note: expanded from macro '__get_user_size_allowed' case 8: __get_user_asm2(x, (u64 __user *)ptr, retval); break; \ ^ arch/powerpc/include/asm/uaccess.h:258:4: note: expanded from macro '__get_user_asm2' " li %1+1,0\n" \ ^ <inline asm>:7:5: note: instantiated into assembly here li 31+1,0 ^ 1 error generated. On PPC32, for 64 bits vars a pair of registers is used. Usually the lower register in the pair is the high part and the higher register is the low part. GCC uses r3/r4 ... r11/r12 ... r14/r15 ... r30/r31 In older kernel code inline assembly was using %1 and %1+1 to represent 64 bits values. However here it looks like clang uses r31 as high part, allthough r32 doesn't exist hence the error. Allthoug %1+1 should work, most places now use %L1 instead of %1+1, so let's do the same here. With that change, the build doesn't fail anymore and a disassembly shows clang uses r17/r18 and r31/r14 pair when GCC would have used r16/r17 and r30/r31: Disassembly of section .fixup: 00000000 <.fixup>: 0: 38 a0 ff f2 li r5,-14 4: 3a 20 00 00 li r17,0 8: 3a 40 00 00 li r18,0 c: 48 00 00 00 b c <.fixup+0xc> c: R_PPC_REL24 .text+0xbc 10: 38 a0 ff f2 li r5,-14 14: 3b e0 00 00 li r31,0 18: 39 c0 00 00 li r14,0 1c: 48 00 00 00 b 1c <.fixup+0x1c> 1c: R_PPC_REL24 .text+0x144 Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602021825.otcItxGi-lkp@intel.com/ Fixes: c20beffeec3c ("powerpc/uaccess: Use flexible addressing with __put_user()/__get_user()") Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Acked-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/8ca3a657a650e497a96bfe7acde2f637dadab344.1770103646.git.chleroy@kernel.org
2026-03-04powerpc/e500: Always use 64 bits PTEChristophe Leroy
Today there are two PTE formats for e500: - The 64 bits format, used - On 64 bits kernel - On 32 bits kernel with 64 bits physical addresses - On 32 bits kernel with support of huge pages - The 32 bits format, used in other cases Maintaining two PTE formats means unnecessary maintenance burden because every change needs to be implemented and tested for both formats. Remove the 32 bits PTE format. The memory usage increase due to larger PTEs is minimal (approx. 0,1% of memory). This also means that from now on huge pages are supported also with 32 bits physical addresses. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/04a658209ea78dcc0f3dbde6b2c29cf1939adfe9.1767721208.git.chleroy@kernel.org
2026-03-03tracing: Fix WARN_ON in tracing_buffers_mmap_closeQing Wang
When a process forks, the child process copies the parent's VMAs but the user_mapped reference count is not incremented. As a result, when both the parent and child processes exit, tracing_buffers_mmap_close() is called twice. On the second call, user_mapped is already 0, causing the function to return -ENODEV and triggering a WARN_ON. Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set. But this is only a hint, and the application can call madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the application does that, it can trigger this issue on fork. Fix it by incrementing the user_mapped reference count without re-mapping the pages in the VMA's open callback. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03tracing: Disable preemption in the tracepoint callbacks handling filtered pidsMasami Hiramatsu (Google)
Filtering PIDs for events triggered the following during selftests: [37] event tracing - restricts events based on pid notrace filtering [ 155.874095] [ 155.874869] ============================= [ 155.876037] WARNING: suspicious RCU usage [ 155.877287] 7.0.0-rc1-00004-g8cd473a19bc7 #7 Not tainted [ 155.879263] ----------------------------- [ 155.882839] kernel/trace/trace_events.c:1057 suspicious rcu_dereference_check() usage! [ 155.889281] [ 155.889281] other info that might help us debug this: [ 155.889281] [ 155.894519] [ 155.894519] rcu_scheduler_active = 2, debug_locks = 1 [ 155.898068] no locks held by ftracetest/4364. [ 155.900524] [ 155.900524] stack backtrace: [ 155.902645] CPU: 1 UID: 0 PID: 4364 Comm: ftracetest Not tainted 7.0.0-rc1-00004-g8cd473a19bc7 #7 PREEMPT(lazy) [ 155.902648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 [ 155.902651] Call Trace: [ 155.902655] <TASK> [ 155.902659] dump_stack_lvl+0x67/0x90 [ 155.902665] lockdep_rcu_suspicious+0x154/0x1a0 [ 155.902672] event_filter_pid_sched_process_fork+0x9a/0xd0 [ 155.902678] kernel_clone+0x367/0x3a0 [ 155.902689] __x64_sys_clone+0x116/0x140 [ 155.902696] do_syscall_64+0x158/0x460 [ 155.902700] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 155.902702] ? trace_irq_disable+0x1d/0xc0 [ 155.902709] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 155.902711] RIP: 0033:0x4697c3 [ 155.902716] Code: 1f 84 00 00 00 00 00 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00 [ 155.902718] RSP: 002b:00007ffc41150428 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 155.902721] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004697c3 [ 155.902722] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 155.902724] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000003fccf990 [ 155.902725] R10: 000000003fccd690 R11: 0000000000000246 R12: 0000000000000001 [ 155.902726] R13: 000000003fce8103 R14: 0000000000000001 R15: 0000000000000000 [ 155.902733] </TASK> [ 155.902747] The tracepoint callbacks recently were changed to allow preemption. The event PID filtering callbacks that were attached to the fork and exit tracepoints expected preemption disabled in order to access the RCU protected PID lists. Add a guard(preempt)() to protect the references to the PID list. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260303215738.6ab275af@fedora Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Link: https://patch.msgid.link/20260303131706.96057f61a48a34c43ce1e396@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03ftrace: Disable preemption in the tracepoint callbacks handling filtered pidsSteven Rostedt
When function trace PID filtering is enabled, the function tracer will attach a callback to the fork tracepoint as well as the exit tracepoint that will add the forked child PID to the PID filtering list as well as remove the PID that is exiting. Commit a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") removed the disabling of preemption when calling tracepoint callbacks. The callbacks used for the PID filtering accounting depended on preemption being disabled, and now the trigger a "suspicious RCU usage" warning message. Make them explicitly disable preemption. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260302213546.156e3e4f@gandalf.local.home Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-03-03tracing: Fix syscall events activation by ensuring refcount hits zeroHuiwen He
When multiple syscall events are specified in the kernel command line (e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close), they are often not captured after boot, even though they appear enabled in the tracing/set_event file. The issue stems from how syscall events are initialized. Syscall tracepoints require the global reference count (sys_tracepoint_refcount) to transition from 0 to 1 to trigger the registration of the syscall work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1). The current implementation of early_enable_events() with disable_first=true used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B". If multiple syscalls are enabled, the refcount never drops to zero, preventing the 0->1 transition that triggers actual registration. Fix this by splitting early_enable_events() into two distinct phases: 1. Disable all events specified in the buffer. 2. Enable all events specified in the buffer. This ensures the refcount hits zero before re-enabling, allowing syscall events to be properly activated during early boot. The code is also refactored to use a helper function to avoid logic duplication between the disable and enable phases. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn Fixes: ce1039bd3a89 ("tracing: Fix enabling of syscall events on the command line") Signed-off-by: Huiwen He <hehuiwen@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03fgraph: Fix thresh_return nosleeptime double-adjustShengming Hu
trace_graph_thresh_return() called handle_nosleeptime() and then delegated to trace_graph_return(), which calls handle_nosleeptime() again. When sleep-time accounting is disabled this double-adjusts calltime and can produce bogus durations (including underflow). Fix this by computing rettime once, applying handle_nosleeptime() only once, using the adjusted calltime for threshold comparison, and writing the return event directly via __trace_graph_return() when the threshold is met. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260221113314048jE4VRwIyZEALiYByGK0My@zte.com.cn Fixes: 3c9880f3ab52b ("ftrace: Use a running sleeptime instead of saving on shadow stack") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03fgraph: Fix thresh_return clear per-task notraceShengming Hu
When tracing_thresh is enabled, function graph tracing uses trace_graph_thresh_return() as the return handler. Unlike trace_graph_return(), it did not clear the per-task TRACE_GRAPH_NOTRACE flag set by the entry handler for set_graph_notrace addresses. This could leave the task permanently in "notrace" state and effectively disable function graph tracing for that task. Mirror trace_graph_return()'s per-task notrace handling by clearing TRACE_GRAPH_NOTRACE and returning early when set. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260221113007819YgrZsMGABff4Rc-O_fZxL@zte.com.cn Fixes: b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03smb: client: Compare MACs in constant timeEric Biggers
To prevent timing attacks, MAC comparisons need to be constant-time. Replace the memcmp() with the correct function, crypto_memneq(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2026-03-03io_uring/mock: Fix typo in help textJ. Neuschäfer
Fix the spelling of "subsystem". Signed-off-by: J. Neuschäfer <j.ne@posteo.net> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-03net/tcp-md5: Fix MAC comparison to be constant-timeEric Biggers
To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.") Fixes: 658ddaaf6694 ("tcp: md5: RST: getting md5 key from listener") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20260302203409.13388-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-03MAINTAINERS: update the skge/sky2 maintainersStephen Hemminger
Mark the skge and sky2 drivers as orphan. I no longer have any Marvell/SysKonnect boards to test with and mail to Mirko Lindner bounced because Marvell sold off that divsion. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://patch.msgid.link/20260302195120.187183-1-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-03amd-xgbe: fix sleep while atomic on suspend/resumeRaju Rangoju
The xgbe_powerdown() and xgbe_powerup() functions use spinlocks (spin_lock_irqsave) while calling functions that may sleep: - napi_disable() can sleep waiting for NAPI polling to complete - flush_workqueue() can sleep waiting for pending work items This causes a "BUG: scheduling while atomic" error during suspend/resume cycles on systems using the AMD XGBE Ethernet controller. The spinlock protection in these functions is unnecessary as these functions are called from suspend/resume paths which are already serialized by the PM core Fix this by removing the spinlock. Since only code that takes this lock is xgbe_powerdown() and xgbe_powerup(), remove it completely. Fixes: c5aa9e3b8156 ("amd-xgbe: Initial AMD 10GbE platform driver") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260302042124.1386445-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-03netconsole: fix sysdata_release_enabled_show checking wrong flagBreno Leitao
sysdata_release_enabled_show() checks SYSDATA_TASKNAME instead of SYSDATA_RELEASE, causing the configfs release_enabled attribute to reflect the taskname feature state rather than the release feature state. This is a copy-paste error from the adjacent sysdata_taskname_enabled_show() function. The corresponding _store function already uses the correct SYSDATA_RELEASE flag. Fixes: 343f90227070 ("netconsole: implement configfs for release_enabled") Signed-off-by: Breno Leitao <leitao@debian.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260302-sysdata_release_fix-v1-1-e5090f677c7c@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-03net: ipv4: fix ARM64 alignment fault in multipath hash seedYung Chih Su
`struct sysctl_fib_multipath_hash_seed` contains two u32 fields (user_seed and mp_seed), making it an 8-byte structure with a 4-byte alignment requirement. In `fib_multipath_hash_from_keys()`, the code evaluates the entire struct atomically via `READ_ONCE()`: mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed; While this silently works on GCC by falling back to unaligned regular loads which the ARM64 kernel tolerates, it causes a fatal kernel panic when compiled with Clang and LTO enabled. Commit e35123d83ee3 ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs under Clang LTO. Since the macro evaluates the full 8-byte struct, Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly requires `ldar` to be naturally aligned, thus executing it on a 4-byte aligned address triggers a strict Alignment Fault (FSC = 0x21). Fix the read side by moving the `READ_ONCE()` directly to the `u32` member, which emits a safe 32-bit `ldar Wn`. Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis shows that Clang splits this 8-byte write into two separate 32-bit `str` instructions. While this avoids an alignment fault, it destroys atomicity and exposes a tear-write vulnerability. Fix this by explicitly splitting the write into two 32-bit `WRITE_ONCE()` operations. Finally, add the missing `READ_ONCE()` when reading `user_seed` in `proc_fib_multipath_hash_seed()` to ensure proper pairing and concurrency safety. Fixes: 4ee2a8cace3f ("net: ipv4: Add a sysctl to set multipath hash seed") Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-03net/tcp-ao: Fix MAC comparison to be constant-timeEric Biggers
To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: 0a3a809089eb ("net/tcp: Verify inbound TCP-AO signed segments") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20260302203600.13561-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-03ipv6: fix NULL pointer deref in ip6_rt_get_dev_rcu()Jakub Kicinski
l3mdev_master_dev_rcu() can return NULL when the slave device is being un-slaved from a VRF. All other callers deal with this, but we lost the fallback to loopback in ip6_rt_pcpu_alloc() -> ip6_rt_get_dev_rcu() with commit 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address"). KASAN: null-ptr-deref in range [0x0000000000000108-0x000000000000010f] RIP: 0010:ip6_rt_pcpu_alloc (net/ipv6/route.c:1418) Call Trace: ip6_pol_route (net/ipv6/route.c:2318) fib6_rule_lookup (net/ipv6/fib6_rules.c:115) ip6_route_output_flags (net/ipv6/route.c:2607) vrf_process_v6_outbound (drivers/net/vrf.c:437) I was tempted to rework the un-slaving code to clear the flag first and insert synchronize_rcu() before we remove the upper. But looks like the explicit fallback to loopback_dev is an established pattern. And I guess avoiding the synchronize_rcu() is nice, too. Fixes: 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address") Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260301194548.927324-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-04rust: str: make NullTerminatedFormatter publicAlexandre Courbot
If `CONFIG_BLOCK` is disabled, the following warnings are displayed during build: warning: struct `NullTerminatedFormatter` is never constructed --> ../rust/kernel/str.rs:667:19 | 667 | pub(crate) struct NullTerminatedFormatter<'a> { | ^^^^^^^^^^^^^^^^^^^^^^^ | = note: `#[warn(dead_code)]` (part of `#[warn(unused)]`) on by default warning: associated function `new` is never used --> ../rust/kernel/str.rs:673:19 | 671 | impl<'a> NullTerminatedFormatter<'a> { | ------------------------------------ associated function in this implementation 672 | /// Create a new [`Self`] instance. 673 | pub(crate) fn new(buffer: &'a mut [u8]) -> Option<NullTerminatedFormatter<'a>> { Fix them by making `NullTerminatedFormatter` public, as it could be useful for drivers anyway. Fixes: cdde7a1951ff ("rust: str: introduce `NullTerminatedFormatter`") Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260224-nullterminatedformatter-v1-1-5bef7b9b3d4c@nvidia.com Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2026-03-03smb/client: remove unused SMB311_posix_query_info()ZhangGuoDong
It is currently unused, as now we are doing compounding instead (see smb2_query_path_info()). Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com>
2026-03-03smb/client: fix buffer size for smb311_posix_qinfo in SMB311_posix_query_info()ZhangGuoDong
SMB311_posix_query_info() is currently unused, but it may still be used in some stable versions, so these changes are submitted as a separate patch. Use `sizeof(struct smb311_posix_qinfo)` instead of sizeof its pointer, so the allocated buffer matches the actual struct size. Fixes: b1bc1874b885 ("smb311: Add support for SMB311 query info (non-compounded)") Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com>
2026-03-03smb/client: fix buffer size for smb311_posix_qinfo in smb2_compound_op()ZhangGuoDong
Use `sizeof(struct smb311_posix_qinfo)` instead of sizeof its pointer, so the allocated buffer matches the actual struct size. Fixes: 6a5f6592a0b6 ("SMB311: Add support for query info using posix extensions (level 100)") Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com>
2026-03-03bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shimLang Xu
The root cause of this bug is that when 'bpf_link_put' reduces the refcount of 'shim_link->link.link' to zero, the resource is considered released but may still be referenced via 'tr->progs_hlist' in 'cgroup_shim_find'. The actual cleanup of 'tr->progs_hlist' in 'bpf_shim_tramp_link_release' is deferred. During this window, another process can cause a use-after-free via 'bpf_trampoline_link_cgroup_shim'. Based on Martin KaFai Lau's suggestions, I have created a simple patch. To fix this: Add an atomic non-zero check in 'bpf_trampoline_link_cgroup_shim'. Only increment the refcount if it is not already zero. Testing: I verified the fix by adding a delay in 'bpf_shim_tramp_link_release' to make the bug easier to trigger: static void bpf_shim_tramp_link_release(struct bpf_link *link) { /* ... */ if (!shim_link->trampoline) return; + msleep(100); WARN_ON_ONCE(bpf_trampoline_unlink_prog(&shim_link->link, shim_link->trampoline, NULL)); bpf_trampoline_put(shim_link->trampoline); } Before the patch, running a PoC easily reproduced the crash(almost 100%) with a call trace similar to KaiyanM's report. After the patch, the bug no longer occurs even after millions of iterations. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3c4ebb0b.46ff8.19abab8abe2.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/279EEE1BA1DDB49D+20260303095217.34436-1-xulang@uniontech.com
2026-03-03Merge tag 'cgroup-for-7.0-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix circular locking dependency in cpuset partition code by deferring housekeeping_update() calls to a workqueue instead of calling them directly under cpus_read_lock - Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when generate_sched_domains() returns NULL due to kmalloc failure - Fix incorrect cpuset behavior for effective_xcpus in partition_xcpus_del() and cpuset_update_tasks_cpumask() in update_cpumasks_hier() - Fix race between task migration and cgroup iteration * tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed cgroup/cpuset: Clarify exclusion rules for cpuset internal variables cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() cgroup: fix race between task migration and iteration
2026-03-03Merge tag 'sched_ext-for-7.0-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix starvation of scx_enable() under fair-class saturation by offloading the enable path to an RT kthread - Fix out-of-bounds access in idle mask initialization on systems with non-contiguous NUMA node IDs - Fix a preemption window during scheduler exit and a refcount underflow in cgroup init error path - Fix SCX_EFLAG_INITIALIZED being a no-op flag - Add READ_ONCE() annotations for KCSAN-clean lockless accesses and replace naked scx_root dereferences with container_of() in kobject callbacks - Tooling and selftest fixes: compilation issues with clang 17, strtoul() misuse, unused options cleanup, and Kconfig sync * tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix starvation of scx_enable() under fair-class saturation sched_ext: Remove redundant css_put() in scx_cgroup_init() selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17 selftests/sched_ext: Add -fms-extensions to bpf build flags tools/sched_ext: Add -fms-extensions to bpf build flags sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout sched_ext: Replace naked scx_root dereferences in kobject callbacks sched_ext: Use READ_ONCE() for the read side of dsq->nr update tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq() sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag sched_ext: Fix out-of-bounds access in scx_idle_init_masks() sched_ext: Disable preemption between scx_claim_exit() and kicking helper work tools/sched_ext: Add Kconfig to sync with upstream tools/sched_ext: Sync README.md Kconfig with upstream scx selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c tools/sched_ext: scx_sdt: Remove unused '-f' option tools/sched_ext: scx_central: Remove unused '-p' option selftests/sched_ext: Fix unused-result warning for read() selftests/sched_ext: Abort test loop on signal
2026-03-03sched_ext: Fix starvation of scx_enable() under fair-class saturationTejun Heo
During scx_enable(), the READY -> ENABLED task switching loop changes the calling thread's sched_class from fair to ext. Since fair has higher priority than ext, saturating fair-class workloads can indefinitely starve the enable thread, hanging the system. This was introduced when the enable path switched from preempt_disable() to scx_bypass() which doesn't protect against fair-class starvation. Note that the original preempt_disable() protection wasn't complete either - in partial switch modes, the calling thread could still be starved after preempt_enable() as it may have been switched to ext class. Fix it by offloading the enable body to a dedicated system-wide RT (SCHED_FIFO) kthread which cannot be starved by either fair or ext class tasks. scx_enable() lazily creates the kthread on first use and passes the ops pointer through a struct scx_enable_cmd containing the kthread_work, then synchronously waits for completion. The workfn runs on a different kthread from sch->helper (which runs disable_work), so it can safely flush disable_work on the error path without deadlock. Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-03igc: Fix trigger of incorrect irq in igc_xsk_wakeup functionVivek Behera
This patch addresses the issue where the igc_xsk_wakeup function was triggering an incorrect IRQ for tx-0 when the i226 is configured with only 2 combined queues or in an environment with 2 active CPU cores. This prevented XDP Zero-copy send functionality in such split IRQ configurations. The fix implements the correct logic for extracting q_vectors saved during rx and tx ring allocation and utilizes flags provided by the ndo_xsk_wakeup API to trigger the appropriate IRQ. Fixes: fc9df2a0b520 ("igc: Enable RX via AF_XDP zero-copy") Fixes: 15fd021bc427 ("igc: Add Tx hardware timestamp request for AF_XDP zero-copy packet") Signed-off-by: Vivek Behera <vivek.behera@siemens.com> Reviewed-by: Jacob Keller <jacob.keller@intel.com> Reviewed-by: Aleksandr loktinov <aleksandr.loktionov@intel.com> Reviewed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Tested-by: Avigail Dahan <avigailx.dahan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03igb: Fix trigger of incorrect irq in igb_xsk_wakeupVivek Behera
The current implementation in the igb_xsk_wakeup expects the Rx and Tx queues to share the same irq. This would lead to triggering of incorrect irq in split irq configuration. This patch addresses this issue which could impact environments with 2 active cpu cores or when the number of queues is reduced to 2 or less cat /proc/interrupts | grep eno2 167: 0 0 0 0 IR-PCI-MSIX-0000:08:00.0 0-edge eno2 168: 0 0 0 0 IR-PCI-MSIX-0000:08:00.0 1-edge eno2-rx-0 169: 0 0 0 0 IR-PCI-MSIX-0000:08:00.0 2-edge eno2-rx-1 170: 0 0 0 0 IR-PCI-MSIX-0000:08:00.0 3-edge eno2-tx-0 171: 0 0 0 0 IR-PCI-MSIX-0000:08:00.0 4-edge eno2-tx-1 Furthermore it uses the flags input argument to trigger either rx, tx or both rx and tx irqs as specified in the ndo_xsk_wakeup api documentation Fixes: 80f6ccf9f116 ("igb: Introduce XSK data structures and helpers") Signed-off-by: Vivek Behera <vivek.behera@siemens.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03iavf: fix netdev->max_mtu to respect actual hardware limitKohei Enju
iavf sets LIBIE_MAX_MTU as netdev->max_mtu, ignoring vf_res->max_mtu from PF [1]. This allows setting an MTU beyond the actual hardware limit, causing TX queue timeouts [2]. Set correct netdev->max_mtu using vf_res->max_mtu from the PF. Note that currently PF drivers such as ice/i40e set the frame size in vf_res->max_mtu, not MTU. Convert vf_res->max_mtu to MTU before setting netdev->max_mtu. [1] # ip -j -d link show $DEV | jq '.[0].max_mtu' 16356 [2] iavf 0000:00:05.0 enp0s5: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5692 ms iavf 0000:00:05.0 enp0s5: NIC Link is Up Speed is 10 Gbps Full Duplex iavf 0000:00:05.0 enp0s5: NETDEV WATCHDOG: CPU: 6: transmit queue 3 timed out 5312 ms iavf 0000:00:05.0 enp0s5: NIC Link is Up Speed is 10 Gbps Full Duplex ... Fixes: 5fa4caff59f2 ("iavf: switch to Page Pool") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03libie: don't unroll if fwlog isn't supportedMichal Swiatkowski
The libie_fwlog_deinit() function can be called during driver unload even when firmware logging was never properly initialized. This led to call trace: [ 148.576156] Oops: Oops: 0000 [#1] SMP NOPTI [ 148.576167] CPU: 80 UID: 0 PID: 12843 Comm: rmmod Kdump: loaded Not tainted 6.17.0-rc7next-queue-3oct-01915-g06d79d51cf51 #1 PREEMPT(full) [ 148.576177] Hardware name: HPE ProLiant DL385 Gen10 Plus/ProLiant DL385 Gen10 Plus, BIOS A42 07/18/2020 [ 148.576182] RIP: 0010:__dev_printk+0x16/0x70 [ 148.576196] Code: 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 49 89 d4 55 48 89 fd 53 48 85 f6 74 3c <4c> 8b 6e 50 48 89 f3 4d 85 ed 75 03 4c 8b 2e 48 89 df e8 f3 27 98 [ 148.576204] RSP: 0018:ffffd2fd7ea17a48 EFLAGS: 00010202 [ 148.576211] RAX: ffffd2fd7ea17aa0 RBX: ffff8eb288ae2000 RCX: 0000000000000000 [ 148.576217] RDX: ffffd2fd7ea17a70 RSI: 00000000000000c8 RDI: ffffffffb68d3d88 [ 148.576222] RBP: ffffffffb68d3d88 R08: 0000000000000000 R09: 0000000000000000 [ 148.576227] R10: 00000000000000c8 R11: ffff8eb2b1a49400 R12: ffffd2fd7ea17a70 [ 148.576231] R13: ffff8eb3141fb000 R14: ffffffffc1215b48 R15: ffffffffc1215bd8 [ 148.576236] FS: 00007f5666ba6740(0000) GS:ffff8eb2472b9000(0000) knlGS:0000000000000000 [ 148.576242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 148.576247] CR2: 0000000000000118 CR3: 000000011ad17000 CR4: 0000000000350ef0 [ 148.576252] Call Trace: [ 148.576258] <TASK> [ 148.576269] _dev_warn+0x7c/0x96 [ 148.576290] libie_fwlog_deinit+0x112/0x117 [libie_fwlog] [ 148.576303] ixgbe_remove+0x63/0x290 [ixgbe] [ 148.576342] pci_device_remove+0x42/0xb0 [ 148.576354] device_release_driver_internal+0x19c/0x200 [ 148.576365] driver_detach+0x48/0x90 [ 148.576372] bus_remove_driver+0x6d/0xf0 [ 148.576383] pci_unregister_driver+0x2e/0xb0 [ 148.576393] ixgbe_exit_module+0x1c/0xd50 [ixgbe] [ 148.576430] __do_sys_delete_module.isra.0+0x1bc/0x2e0 [ 148.576446] do_syscall_64+0x7f/0x980 It can be reproduced by trying to unload ixgbe driver in recovery mode. Fix that by checking if fwlog is supported before doing unroll. Fixes: 641585bc978e ("ixgbe: fwlog support for e610") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03ice: Fix memory leak in ice_set_ringparam()Zilin Guan
In ice_set_ringparam, tx_rings and xdp_rings are allocated before rx_rings. If the allocation of rx_rings fails, the code jumps to the done label leaking both tx_rings and xdp_rings. Furthermore, if the setup of an individual Rx ring fails during the loop, the code jumps to the free_tx label which releases tx_rings but leaks xdp_rings. Fix this by introducing a free_xdp label and updating the error paths to ensure both xdp_rings and tx_rings are properly freed if rx_rings allocation or setup fails. Compile tested only. Issue found using a prototype static analysis tool and code review. Fixes: fcea6f3da546 ("ice: Add stats and ethtool support") Fixes: efc2214b6047 ("ice: Add support for XDP") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03ice: fix retry for AQ command 0x06EEJakub Staniszewski
Executing ethtool -m can fail reporting a netlink I/O error while firmware link management holds the i2c bus used to communicate with the module. According to Intel(R) Ethernet Controller E810 Datasheet Rev 2.8 [1] Section 3.3.10.4 Read/Write SFF EEPROM (0x06EE) request should to be retried upon receiving EBUSY from firmware. Commit e9c9692c8a81 ("ice: Reimplement module reads used by ethtool") implemented it only for part of ice_get_module_eeprom(), leaving all other calls to ice_aq_sff_eeprom() vulnerable to returning early on getting EBUSY without retrying. Remove the retry loop from ice_get_module_eeprom() and add Admin Queue (AQ) command with opcode 0x06EE to the list of commands that should be retried on receiving EBUSY from firmware. Cc: stable@vger.kernel.org Fixes: e9c9692c8a81 ("ice: Reimplement module reads used by ethtool") Signed-off-by: Jakub Staniszewski <jakub.staniszewski@linux.intel.com> Co-developed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://www.intel.com/content/www/us/en/content-details/613875/intel-ethernet-controller-e810-datasheet.html [1] Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03ice: reintroduce retry mechanism for indirect AQJakub Staniszewski
Add retry mechanism for indirect Admin Queue (AQ) commands. To do so we need to keep the command buffer. This technically reverts commit 43a630e37e25 ("ice: remove unused buffer copy code in ice_sq_send_cmd_retry()"), but combines it with a fix in the logic by using a kmemdup() call, making it more robust and less likely to break in the future due to programmer error. Cc: Michal Schmidt <mschmidt@redhat.com> Cc: stable@vger.kernel.org Fixes: 3056df93f7a8 ("ice: Re-send some AQ commands, as result of EBUSY AQ error") Signed-off-by: Jakub Staniszewski <jakub.staniszewski@linux.intel.com> Co-developed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03ice: fix adding AQ LLDP filter for VFLarysa Zaremba
The referenced commit came from a misunderstanding of the FW LLDP filter AQ (Admin Queue) command due to the error in the internal documentation. Contrary to the assumptions in the original commit, VFs can be added and deleted from this filter without any problems. Introduced dev_info message proved to be useful, so reverting the whole commit does not make sense. Without this fix, trusted VFs do not receive LLDP traffic, if there is an AQ LLDP filter on PF. When trusted VF attempts to add an LLDP multicast MAC address, the following message can be seen in dmesg on host: ice 0000:33:00.0: Failed to add Rx LLDP rule on VSI 20 error: -95 Revert checking VSI type when adding LLDP filter through AQ. Fixes: 4d5a1c4e6d49 ("ice: do not add LLDP-specific filter if not necessary") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-03crypto: testmgr - Fix stale references to aes-genericEric Biggers
Due to commit a2484474272e ("crypto: aes - Replace aes-generic with wrapper around lib"), the "aes-generic" driver name has been replaced with "aes-lib". Update a couple testmgr entries that were added concurrently with this change. Fixes: a22d48cbe558 ("crypto: testmgr - Add test vectors for authenc(hmac(sha224),cbc(aes))") Fixes: 030218dedee2 ("crypto: testmgr - Add test vectors for authenc(hmac(sha384),cbc(aes))") Acked-by: Aleksander Jan Bajkowski <olek2@wp.pl> Link: https://lore.kernel.org/r/20260302234856.30569-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-03-03drm/msm: Fix dma_free_attrs() buffer sizeThomas Fourier
The gpummu->table buffer is alloc'd with size TABLE_SIZE + 32 in a2xx_gpummu_new() but freed with size TABLE_SIZE in a2xx_gpummu_destroy(). Change the free size to match the allocation. Fixes: c2052a4e5c99 ("drm/msm: implement a2xx mmu") Cc: <stable@vger.kernel.org> Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Patchwork: https://patchwork.freedesktop.org/patch/707340/ Message-ID: <20260226095714.12126-2-fourier.thomas@gmail.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>