Create dedicated EQs and MSI for each vPort#4
Open
longlimsft wants to merge 6 commits into
Open
Conversation
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ sharing among the vPorts and create dedicated EQs for each vPort. Move the EQ definition from struct mana_context to struct mana_port_context and update related support functions. Export mana_create_eq() and mana_destroy_eq() for use by the MANA RDMA driver.
When querying the device, adjust the max number of queues to allow dedicated MSI-X vectors for each vPort. The number of queues per vPort is clamped to no less than 16. MSI-X sharing among vPorts is disabled by default and is only enabled when there are not enough MSI-X vectors for dedicated allocation. Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is used at GDMA device probe time for querying device capabilities.
…ement To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with reference counting. This allows the driver to create an interrupt context on an assigned or unassigned MSI-X vector and share it across multiple EQ consumers.
Replace the GDMA global interrupt setup code with the new GIC allocation and release functions for managing interrupt contexts.
Use GIC functions to create a dedicated interrupt context or acquire a shared interrupt context for each EQ when setting up a vPort.
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These interrupt contexts may be shared with Ethernet EQs when MSI-X vectors are limited. The driver now supports allocating dedicated MSI-X for each EQ. Indicate this capability through driver capability bits.
longlimsft
pushed a commit
that referenced
this pull request
Jun 4, 2026
Since the start of the git history, brport_store() always acquired the
bridge lock. Back then this decision made sense: The bridge lock
protects the STP state of the bridge and its ports and at that time the
function was only used by two STP related attributes (cost and
priority).
Nowadays, brport_store() processes a lot more attributes and most of
them do not need the bridge lock:
* Bridge flags: Only require RTNL. Read locklessly by the data path.
Annotations can be added in net-next.
* FDB port flushing: Only requires the FDB lock.
* Multicast attributes: Only require the multicast lock.
* Group forward mask: Only requires RTNL. Read locklessly by the data
path. Annotations can be added in net-next.
* Backup port: Only requires RTNL. Read locklessly by the data path.
This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].
Fix this by reducing the scope of the bridge lock and only take it when
processing the two STP related attributes that require it. Remove the
now stale comment from br_switchdev_set_port_flag(). The
SWITCHDEV_F_DEFER flag can be removed in net-next.
[1]
BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 372, name: bash
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
5 locks held by bash/372:
#0: ffff88810c51c3f0 (sb_writers#7){.+.+}-{0:0}, at: ksys_write (fs/read_write.c:740)
#1: ffff888115ce9480 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter (fs/kernfs/file.c:343)
#2: ffff88810b9fd330 (kn->active#37){.+.+}-{0:0}, at: kernfs_fop_write_iter (fs/kernfs/file.c:80 fs/kernfs/file.c:344)
#3: ffffffffa59473a0 (rtnl_mutex){+.+.}-{4:4}, at: brport_store (net/bridge/br_sysfs_if.c:326)
#4: ffff8881099d2d58 (&br->lock){+...}-{3:3}, at: brport_store (./include/linux/spinlock.h:348 net/bridge/br_sysfs_if.c:345)
Preemption disabled at:
0x0
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
__might_resched.cold (kernel/sched/core.c:9163)
netif_rx_mode_run (net/core/dev_addr_lists.c:1262)
netif_rx_mode_sync (net/core/dev_addr_lists.c:1428)
dev_set_promiscuity (net/core/dev_api.c:289)
br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172)
br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747)
store_learning (net/bridge/br_sysfs_if.c:79 net/bridge/br_sysfs_if.c:235)
brport_store (net/bridge/br_sysfs_if.c:346)
kernfs_fop_write_iter (fs/kernfs/file.c:352)
new_sync_write (fs/read_write.c:595)
vfs_write (fs/read_write.c:688)
ksys_write (fs/read_write.c:740)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
Fixes: 78cd408 ("net: add missing instance lock to dev_set_promiscuity")
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526064818.272516-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
longlimsft
pushed a commit
that referenced
this pull request
Jun 4, 2026
…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.1, take #4 - Restore CONFIG_PKVM_DISABLE_STAGE2_ON_PANIC to its former glory by making sure the config symbol is correctly spelled out in the code - Don't reset the AArch32 view of the PMU counters to zero when the guest is writing to them - Fix an assorted collection of memory leaks in the newly added tracing code - Fix the capping of ZCR_EL2 which could be used in an unsanitised way by an L2 guest
longlimsft
pushed a commit
that referenced
this pull request
Jun 4, 2026
Writing to the netdevsim debugfs file "netdevsim/netdevsimN/fib/nexthop_bucket_activity" enters nsim_nexthop_bucket_activity_write(), which looks up a nexthop in data->nexthop_ht under rtnl_lock(). If a network namespace teardown, devlink reload or device deletion runs concurrently, nsim_fib_destroy() frees that rhashtable (and the surrounding nsim_fib_data) while the write is still in flight, leading to a slab-use-after-free: BUG: KASAN: slab-use-after-free in nsim_nexthop_bucket_activity_write+0xb9e/0xdf0 Read of size 4 at addr ff1100001a379808 by task syz.0.11967/27894 CPU: 0 UID: 0 PID: 27894 Comm: syz.0.11967 Not tainted 7.1.0-rc4-gf6f1bfc1980a #4 Call Trace: nsim_nexthop_bucket_activity_write+0xb9e/0xdf0 full_proxy_write+0x135/0x1a0 vfs_write+0x2e2/0x1040 ksys_write+0x146/0x270 __x64_sys_write+0x76/0xb0 do_syscall_64+0xb9/0x5b0 entry_SYSCALL_64_after_hwframe+0x74/0x7c Allocated by task 15957: rhashtable_init_noprof+0x3ec/0x860 nsim_fib_create+0x371/0xca0 nsim_drv_probe+0xd60/0x15c0 ... new_device_store+0x425/0x7f0 Freed by task 24: rhashtable_free_and_destroy+0x10d/0x620 nsim_fib_destroy+0xc9/0x1c0 nsim_dev_reload_destroy+0x1e7/0x530 nsim_dev_reload_down+0x6b/0xd0 devlink_reload+0x1b5/0x770 devlink_pernet_pre_exit+0x25d/0x3a0 ops_undo_list+0x1b7/0xb90 cleanup_net+0x47f/0x8a0 The buggy address belongs to the object at ff1100001a379800 which belongs to the cache kmalloc-1k of size 1024 The freed 1k object is the bucket table of data->nexthop_ht. Shortly after, the dangling table is dereferenced again and the machine also takes a GPF in __rht_bucket_nested() from the same call site. The root cause is a lifetime mismatch: the debugfs files reference nsim_fib_data (the writer dereferences data->nexthop_ht), but the interface is not bracketed around the lifetime of that data. nsim_fib_destroy() freed both rhashtables and only removed the debugfs directory afterwards, and nsim_fib_create() created the debugfs files before the rhashtables were initialized and, on the error path, freed them before removing the files. debugfs keeps the file itself alive across a ->write() via debugfs_file_get()/debugfs_file_put() (fs/debugfs/file.c), but it does not keep data->nexthop_ht alive, so the in-flight writer dereferenced freed memory. rtnl_lock() in the writer does not help, because the teardown path does not take rtnl around rhashtable_free_and_destroy(). Fix it by bracketing the debugfs interface around the data it exposes, keeping nsim_fib_create() and nsim_fib_destroy() symmetric: - In nsim_fib_destroy(), tear down the debugfs files before the data structures they reference. debugfs_remove_recursive() drops the initial active-user reference and then waits for every in-flight ->write() to drop its reference before returning, and rejects new opens (__debugfs_file_removed(), fs/debugfs/inode.c). Once it returns, no debugfs accessor can reach the FIB data, so the rhashtables and nsim_fib_data can be destroyed safely. This also covers the bool knobs in the same directory, which store pointers into the same nsim_fib_data, and the final kfree(data). - In nsim_fib_create(), create the debugfs files after the rhashtables and notifiers are set up. This closes the same race on the error-unwind path, where a concurrent writer could otherwise observe a half-constructed instance or a table that the unwind has already freed. (With only the destroy-side change, a writer racing the create window instead dereferences an uninitialized data->nexthop_ht.) This is reproducible by racing, in a loop, writes to /sys/kernel/debug/netdevsim/netdevsimN/fib/nexthop_bucket_activity against a teardown of the same netdevsim instance -- a devlink reload ("devlink dev reload netdevsim/netdevsimN"), destroying the network namespace it lives in, or "echo N > /sys/bus/netdevsim/del_device". It was found with syzkaller; a syzkaller reproducer is available. A standalone C reproducer does not trigger it reliably because the race needs the netns-teardown/reload path. Cc: <stable+noautosel@kernel.org> # netdevsim is a test harness, it's never loaded on production systems Signed-off-by: Zijing Yin <yzjaurora@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260529135718.1804031-1-yzjaurora@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.