forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 1
Sync update #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Sync update #33
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
They all do the same basic thing. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-17-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <[email protected]>
Allow to walk the unified namespace list completely locklessly. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-18-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <[email protected]>
Add a new listns() system call that allows userspace to iterate through namespaces in the system. This provides a programmatic interface to discover and inspect namespaces, enhancing existing namespace apis. Currently, there is no direct way for userspace to enumerate namespaces in the system. Applications must resort to scanning /proc/<pid>/ns/ across all processes, which is: 1. Inefficient - requires iterating over all processes 2. Incomplete - misses inactive namespaces that aren't attached to any running process but are kept alive by file descriptors, bind mounts, or parent namespace references 3. Permission-heavy - requires access to /proc for many processes 4. No ordering or ownership. 5. No filtering per namespace type: Must always iterate and check all namespaces. The list goes on. The listns() system call solves these problems by providing direct kernel-level enumeration of namespaces. It is similar to listmount() but obviously tailored to namespaces. /* * @Req: Pointer to struct ns_id_req specifying search parameters * @ns_ids: User buffer to receive namespace IDs * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return) * @flags: Reserved for future use (must be 0) */ ssize_t listns(const struct ns_id_req *req, u64 *ns_ids, size_t nr_ns_ids, unsigned int flags); Returns: - On success: Number of namespace IDs written to ns_ids - On error: Negative error code /* * @SiZe: Structure size * @ns_id: Starting point for iteration; use 0 for first call, then * use the last returned ID for subsequent calls to paginate * @ns_type: Bitmask of namespace types to include (from enum ns_type): * 0: Return all namespace types * MNT_NS: Mount namespaces * NET_NS: Network namespaces * USER_NS: User namespaces * etc. Can be OR'd together * @user_ns_id: Filter results to namespaces owned by this user namespace: * 0: Return all namespaces (subject to permission checks) * LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace * Other value: Namespaces owned by the specified user namespace ID */ struct ns_id_req { __u32 size; /* sizeof(struct ns_id_req) */ __u32 spare; /* Reserved, must be 0 */ __u64 ns_id; /* Last seen namespace ID (for pagination) */ __u32 ns_type; /* Filter by namespace type(s) */ __u32 spare2; /* Reserved, must be 0 */ __u64 user_ns_id; /* Filter by owning user namespace */ }; Example 1: List all namespaces void list_all_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, /* Start from beginning */ .ns_type = 0, /* All types */ .user_ns_id = 0, /* All user namespaces */ }; uint64_t ids[100]; ssize_t ret; printf("All namespaces in the system:\n"); do { ret = listns(&req, ids, 100, 0); if (ret < 0) { perror("listns"); break; } for (ssize_t i = 0; i < ret; i++) printf(" Namespace ID: %llu\n", (unsigned long long)ids[i]); /* Continue from last seen ID */ if (ret > 0) req.ns_id = ids[ret - 1]; } while (ret == 100); /* Buffer was full, more may exist */ } Example 2: List network namespaces only void list_network_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = NET_NS, /* Only network namespaces */ .user_ns_id = 0, }; uint64_t ids[100]; ssize_t ret; ret = listns(&req, ids, 100, 0); if (ret < 0) { perror("listns"); return; } printf("Network namespaces: %zd found\n", ret); for (ssize_t i = 0; i < ret; i++) printf(" netns ID: %llu\n", (unsigned long long)ids[i]); } Example 3: List namespaces owned by current user namespace void list_owned_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = 0, /* All types */ .user_ns_id = LISTNS_CURRENT_USER, /* Current userns */ }; uint64_t ids[100]; ssize_t ret; ret = listns(&req, ids, 100, 0); if (ret < 0) { perror("listns"); return; } printf("Namespaces owned by my user namespace: %zd\n", ret); for (ssize_t i = 0; i < ret; i++) printf(" ns ID: %llu\n", (unsigned long long)ids[i]); } Example 4: List multiple namespace types void list_network_and_mount_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = NET_NS | MNT_NS, /* Network and mount */ .user_ns_id = 0, }; uint64_t ids[100]; ssize_t ret; ret = listns(&req, ids, 100, 0); printf("Network and mount namespaces: %zd found\n", ret); } Example 5: Pagination through large namespace sets void list_all_with_pagination(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = 0, .user_ns_id = 0, }; uint64_t ids[50]; size_t total = 0; ssize_t ret; printf("Enumerating all namespaces with pagination:\n"); while (1) { ret = listns(&req, ids, 50, 0); if (ret < 0) { perror("listns"); break; } if (ret == 0) break; /* No more namespaces */ total += ret; printf(" Batch: %zd namespaces\n", ret); /* Last ID in this batch becomes start of next batch */ req.ns_id = ids[ret - 1]; if (ret < 50) break; /* Partial batch = end of results */ } printf("Total: %zu namespaces\n", total); } Permission Model listns() respects namespace isolation and capabilities: (1) Global listing (user_ns_id = 0): - Requires CAP_SYS_ADMIN in the namespace's owning user namespace - OR the namespace must be in the caller's namespace context (e.g., a namespace the caller is currently using) - User namespaces additionally allow listing if the caller has CAP_SYS_ADMIN in that user namespace itself (2) Owner-filtered listing (user_ns_id != 0): - Requires CAP_SYS_ADMIN in the specified owner user namespace - OR the namespace must be in the caller's namespace context - This allows unprivileged processes to enumerate namespaces they own (3) Visibility: - Only "active" namespaces are listed - A namespace is active if it has a non-zero __ns_ref_active count - This includes namespaces used by running processes, held by open file descriptors, or kept active by bind mounts - Inactive namespaces (kept alive only by internal kernel references) are not visible via listns() Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-19-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <[email protected]>
Add the listns() system call to all architectures. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-20-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Ensure all the new uapi bits are visible for the selftests. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-21-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
This is effectively unused and doesn't really server any purpose after having reviewed all of the tests that rely on it. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-22-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that initial namespaces can be reopened via file handle. Initial namespaces should always have a ref count of one from boot. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-23-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test namespace lifecycle: create a namespace in a child process, get a file handle while it's active, then try to reopen after the process exits (namespace becomes inactive). Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-24-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that a namespace remains active while a process is using it, even after the creating process exits. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-25-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test user namespace active ref tracking via credential lifecycle. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-26-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test PID namespace active ref tracking Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-27-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that an open file descriptor keeps a namespace active. Even after the creating process exits, the namespace should remain active as long as an fd is held open. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-28-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test hierarchical active reference propagation. When a child namespace is active, its owning user namespace should also be active automatically due to hierarchical active reference propagation. This ensures parents are always reachable when children are active. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-29-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that bind mounts keep namespaces in the tree even when inactive Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-30-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test multi-level hierarchy (3+ levels deep). Grandparent → Parent → Child When child is active, both parent AND grandparent should be active. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-31-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test multiple children sharing same parent. Parent should stay active as long as ANY child is active. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-32-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that different namespace types with same owner all contribute active references to the owning user namespace. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-33-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test hierarchical propagation with deep namespace hierarchy. Create: init_user_ns -> user_A -> user_B -> net_ns When net_ns is active, both user_A and user_B should be active. This verifies the conditional recursion in __ns_ref_active_put() works. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-34-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that parent stays active as long as ANY child is active. Create parent user namespace with two child net namespaces. Parent should remain active until BOTH children are inactive. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-35-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that user namespace as a child also propagates correctly. Create user_A -> user_B, verify when user_B is active that user_A is also active. This is different from non-user namespace children. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-36-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test different namespace types (net, uts, ipc) all contributing active references to the same owning user namespace. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-37-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Add a wrapper for the listns() system call. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-38-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test basic listns() functionality with the unified namespace tree. List all active namespaces globally. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-39-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
test listns() with type filtering. List only network namespaces. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-40-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test listns() pagination. List namespaces in batches. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-41-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test listns() with LISTNS_CURRENT_USER. List namespaces owned by current user namespace. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-42-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that listns() only returns active namespaces. Create a namespace, let it become inactive, verify it's not listed. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-43-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test listns() with specific user namespace ID. Create a user namespace and list namespaces it owns. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-44-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test listns() with multiple namespace types filter. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-45-2e6f823ebdc0@kernel.org Tested-by: [email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Test that hierarchical active reference propagation keeps parent user namespaces visible in listns(). Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-46-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
There are some cases where when iomap_read_end() is called, the folio may already have been marked uptodate. For example, if the iomap block needed zeroing, then the folio may have been marked uptodate after the zeroing. iomap_read_end() should unlock the folio instead of calling folio_end_read(), which is how these cases were handled prior to commit f8eaf79 ("iomap: simplify ->read_folio_range() error handling for reads"). Calling folio_end_read() on an uptodate folio leads to buggy behavior where marking an already uptodate folio as uptodate will XOR it to be marked nonuptodate. Fixes: f8eaf79 ("iomap: simplify ->read_folio_range() error handling for reads") Signed-off-by: Joanne Koong <[email protected]> Link: https://patch.msgid.link/[email protected] Tested-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reported-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Since commit 222f2c7 ("iomap: always run error completions in user context"), read error completions are deferred to s_dio_done_wq. This means the workqueue also needs to be allocated for async reads. Fixes: 222f2c7 ("iomap: always run error completions in user context") Reported-by: [email protected] Signed-off-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Tested-by: [email protected] Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
In the inode hash code grab the state while ->i_lock is held. If found to be set, synchronize the sleep once more with the lock held. In the real world the flag is not set most of the time. Apart from being simpler to reason about, it comes with a minor speed up as now clearing the flag does not require the smp_mb() fence. While here rename wait_on_inode() to wait_on_new_inode() to line it up with __wait_on_freeing_inode(). Christian Brauner <[email protected]> says: As per the discussion in [1] I folded in the diff sent in [2]. Link: https://lore.kernel.org/[email protected] [1] Link: https://lore.kernel.org/c2kpawomkbvtahjm7y5mposbhckb7wxthi3iqy5yr22ggpucrm@ufvxwy233qxo [2] Signed-off-by: Mateusz Guzik <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Christian Brauner <[email protected]>
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and inode_add_lru(). move it up 2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is needed 3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself (inode_lru_list_del()) and similar routines for sb and io list management 4. push list presence check into inode_lru_list_del(), just like sb and io list Signed-off-by: Mateusz Guzik <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
For consistency with sb routines. ext4 is the only consumer outside of evict(). Damage-controlling it is outside of the scope of this cleanup. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Currently the two high-level APIs use two helper functions to implement almost all of the logic. Refactor the two helpers and the common logic into a new file_update_time_flags routine that gets the iocb flags or 0 in case of file_update_time passed so that the entire logic is contained in a single function and can be easily understood and modified. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
FMODE_NOCMTIME used to be just a hack for the legacy XFS handle-based "invisible I/O", but commit e5e9b24 ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") started using it from generic callers. I'm not sure other file systems are actually read for this in general, so the above commit should get a closer look, but for it to make any sense, file_update_time needs to respect the flag. Lift the check from file_modified_flags to file_update_time so that users of file_update_time inherit the behavior and so that all the checks are done in one place. Fixes: e5e9b24 ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") Signed-off-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
This will be used to replace an incorrect direct call into generic_update_time in btrfs. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Btrfs updates the device node timestamps for block device special files when it stop using the device. Commit 8f96a5b ("btrfs: update the bdev time directly when closing") switch that update from the correct layering to directly call the low-level helper on the bdev inode. This is wrong and got fixed in commit 54fde91 ("btrfs: update device path inode time instead of bd_inode") by updating the file system inode instead of the bdev inode, but this kept the incorrect bypassing of the VFS interfaces and file system ->update_times method. Fix this by using the propet vfs_utimes interface. Fixes: 8f96a5b ("btrfs: update the bdev time directly when closing") Fixes: 54fde91 ("btrfs: update device path inode time instead of bd_inode") Signed-off-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Since commit e41f941 ("Btrfs: move over to use ->update_time") this is not a copy of the high-level file_update_time helper. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Christian Brauner <[email protected]>
Orangefs has no i_version handling and __orangefs_setattr already explicitly marks the inode dirty. So instead of the using the flags return value from generic_update_time, just call the lower level inode_update_timestamps helper directly. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
Christoph Hellwig <[email protected]> says: [Fix] the layering bypass in btrfs when updating timestamps on device files for devices removed from btrfs usage, and FMODE_NOCMTIME handling in the VFS now that nfsd started using it. Note that I'm still not sure that nfsd usage is fully correct for all file systems, as only XFS explicitly supports FMODE_NOCMTIME, but at least the generic code does the right thing now. * patches from https://patch.msgid.link/[email protected]: orangefs: use inode_update_timestamps directly btrfs: fix the comment on btrfs_update_time btrfs: use vfs_utimes to update file timestamps fs: export vfs_utimes fs: lift the FMODE_NOCMTIME check into file_update_time_flags fs: refactor file timestamp update logic Link: https://patch.msgid.link/[email protected] Signed-off-by: Christian Brauner <[email protected]>
Symlink handling is already marked as unlikely and pushing out some of it into pick_link() reduces register spillage on entry to step_into() with gcc 14.2. The compiler needed additional convincing that handle_mounts() is unlikely to fail. At the same time neither clang nor gcc could be convinced to tail-call into pick_link(). While pick_link() takes an address of stack-based object as an argument (which definitely prevents the optimization), splitting it into separate <dentry, mount> tuple did not help. The issue persists even when compiled without stack protector. As such nothing was done about this for the time being to not grow the diff. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Christian Brauner <[email protected]>
The primary consumer is link_path_walk(), calling walk_component() every time which in turn calls step_into(). Inlining these saves overhead of 2 function calls per path component, along with allowing the compiler to do better job optimizing them in place. step_into() had absolutely atrocious assembly to facilitate the slowpath. In order to lessen the burden at the callsite all the hard work is moved into step_into_slowpath() and instead an inline-able fastpath is implemented for rcu-walk. The new fastpath is a stripped down step_into() RCU handling with a d_managed() check from handle_mounts(). Benchmarked as follows on Sapphire Rapids: 1. the "before" was a kernel with not-yet-merged optimizations (notably elision of calls to security_inode_permission() and marking ext4 inodes as not having acls as applicable) 2. "after" is the same + the prep patch + this patch 3. benchmark consists of issuing 205 calls to access(2) in a loop with pathnames lifted out of gcc and the linker building real code, most of which have several path components and 118 of which fail with -ENOENT. Result in terms of ops/s: before: 21619 after: 22536 (+4%) profile before: 20.25% [kernel] [k] __d_lookup_rcu 10.54% [kernel] [k] link_path_walk 10.22% [kernel] [k] entry_SYSCALL_64 6.50% libc.so.6 [.] __GI___access 6.35% [kernel] [k] strncpy_from_user 4.87% [kernel] [k] step_into 3.68% [kernel] [k] kmem_cache_alloc_noprof 2.88% [kernel] [k] walk_component 2.86% [kernel] [k] kmem_cache_free 2.14% [kernel] [k] set_root 2.08% [kernel] [k] lookup_fast after: 23.38% [kernel] [k] __d_lookup_rcu 11.27% [kernel] [k] entry_SYSCALL_64 10.89% [kernel] [k] link_path_walk 7.00% libc.so.6 [.] __GI___access 6.88% [kernel] [k] strncpy_from_user 3.50% [kernel] [k] kmem_cache_alloc_noprof 2.01% [kernel] [k] kmem_cache_free 2.00% [kernel] [k] set_root 1.99% [kernel] [k] lookup_fast 1.81% [kernel] [k] do_syscall_64 1.69% [kernel] [k] entry_SYSCALL_64_safe_stack While walk_component() and step_into() of course disappear from the profile, the link_path_walk() barely gets more overhead despite the inlining thanks to the fast path added and while completing more walks per second. I did not investigate why overhead grew a lot on __d_lookup_rcu(). Signed-off-by: Mateusz Guzik <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Christian Brauner <[email protected]>
Cleanup step_into() and walk_component() and inline them both. * patches from https://patch.msgid.link/[email protected]: fs: inline step_into() and walk_component() fs: tidy up step_into() & friends before inlining Link: https://patch.msgid.link/[email protected] Signed-off-by: Christian Brauner <[email protected]>
German Maglione is a co-maintainer of the virtiofsd userspace device implementation (https://gitlab.com/virtio-fs/virtiofsd) and is currently one of the most active virtiofs developers outside the kernel. I have not worked on virtiofs except to review kernel patches for a few years now and would like German to take over from me gradually. It is healthier to have a kernel maintainer who is actively involved. I expect to remove myself in a few months. Signed-off-by: Stefan Hajnoczi <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Christian Brauner <[email protected]>
Rationale is that if the parent dentry is the same and the length is the same, then you have to be unlucky for the name to not match. At the same time the dentry was literally just found on the hash, so you have to be even more unlucky to determine it is unhashed. While here add commentary while d_unhashed() is necessary. It was already removed once and brought back in: 2e32180 ("Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu"") Signed-off-by: Mateusz Guzik <[email protected]> Link: https://patch.msgid.link/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
…kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
"FUSE iomap Support for Buffered Reads:
This adds iomap support for FUSE buffered reads and readahead. This
enables granular uptodate tracking with large folios so only
non-uptodate portions need to be read. Also fixes a race condition
with large folios + writeback cache that could cause data corruption
on partial writes followed by reads.
- Refactored iomap read/readahead bio logic into helpers
- Added caller-provided callbacks for read operations
- Moved buffered IO bio logic into new file
- FUSE now uses iomap for read_folio and readahead
Zero Range Folio Batch Support:
Add folio batch support for iomap_zero_range() to handle dirty
folios over unwritten mappings. Fix raciness issues where dirty data
could be lost during zero range operations.
- filemap_get_folios_tag_range() helper for dirty folio lookup
- Optional zero range dirty folio processing
- XFS fills dirty folios on zero range of unwritten mappings
- Removed old partial EOF zeroing optimization
DIO Write Completions from Interrupt Context:
Restore pre-iomap behavior where pure overwrite completions run
inline rather than being deferred to workqueue. Reduces context
switches for high-performance workloads like ScyllaDB.
- Removed unused IOCB_DIO_CALLER_COMP code
- Error completions always run in user context (fixes zonefs)
- Reworked REQ_FUA selection logic
- Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP
Buffered IO Cleanups:
Some performance and code clarity improvements:
- Replace manual bitmap scanning with find_next_bit()
- Simplify read skip logic for writes
- Optimize pending async writeback accounting
- Better variable naming
- Documentation for iomap_finish_folio_write() requirements
Misaligned Vectors for Zoned XFS:
Enables sub-block aligned vectors in XFS always-COW mode for zoned
devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.
Bug Fixes:
- Allocate s_dio_done_wq for async reads (fixes syzbot report after
error completion changes)
- Fix iomap_read_end() for already uptodate folios (regression fix)"
* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
iomap: allocate s_dio_done_wq for async reads as well
iomap: fix iomap_read_end() for already uptodate folios
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
iomap: support write completions from interrupt context
iomap: rework REQ_FUA selection
iomap: always run error completions in user context
fs, iomap: remove IOCB_DIO_CALLER_COMP
iomap: use find_next_bit() for uptodate bitmap scanning
iomap: use find_next_bit() for dirty bitmap scanning
iomap: simplify when reads can be skipped for writes
iomap: simplify ->read_folio_range() error handling for reads
iomap: optimize pending async writeback accounting
docs: document iomap writeback's iomap_finish_folio_write() requirement
iomap: account for unaligned end offsets when truncating read range
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
xfs: support sub-block aligned vectors in always COW mode
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
xfs: error tag to force zeroing on debug kernels
iomap: remove old partial eof zeroing optimization
xfs: fill dirty folios on zero range of unwritten mappings
...
…ernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
…kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
…nux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
…kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains substantial namespace infrastructure changes including a new
system call, active reference counting, and extensive header cleanups.
The branch depends on the shared kbuild branch for -fms-extensions support.
Features:
- listns() system call
Add a new listns() system call that allows userspace to iterate
through namespaces in the system. This provides a programmatic
interface to discover and inspect namespaces, addressing
longstanding limitations:
Currently, there is no direct way for userspace to enumerate
namespaces. Applications must resort to scanning /proc/*/ns/ across
all processes, which is:
- Inefficient - requires iterating over all processes
- Incomplete - misses namespaces not attached to any running
process but kept alive by file descriptors, bind mounts, or
parent references
- Permission-heavy - requires access to /proc for many processes
- No ordering or ownership information
- No filtering per namespace type
The listns() system call solves these problems:
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
struct ns_id_req {
__u32 size;
__u32 spare;
__u64 ns_id;
struct /* listns */ {
__u32 ns_type;
__u32 spare2;
__u64 user_ns_id;
};
};
Features include:
- Pagination support for large namespace sets
- Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
- Filtering by owning user namespace
- Permission checks respecting namespace isolation
- Active Reference Counting
Introduce an active reference count that tracks namespace
visibility to userspace. A namespace is visible in the following
cases:
- The namespace is in use by a task
- The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount)
- The namespace is a hierarchical type and is the parent of child
namespaces
The active reference count does not regulate lifetime (that's still
done by the normal reference count) - it only regulates visibility
to namespace file handles and listns().
This prevents resurrection of namespaces that are pinned only for
internal kernel reasons (e.g., user namespaces held by
file->f_cred, lazy TLB references on idle CPUs, etc.) which should
not be accessible via (1)-(3).
- Unified Namespace Tree
Introduce a unified tree structure for all namespaces with:
- Fixed IDs assigned to initial namespaces
- Lookup based solely on inode number
- Maintained list of owned namespaces per user namespace
- Simplified rbtree comparison helpers
Cleanups
- Header Reorganization:
- Move namespace types into separate header (ns_common_types.h)
- Decouple nstree from ns_common header
- Move nstree types into separate header
- Switch to new ns_tree_{node,root} structures with helper functions
- Use guards for ns_tree_lock
- Initial Namespace Reference Count Optimization
- Make all reference counts on initial namespaces a nop to avoid
pointless cacheline ping-pong for namespaces that can never go
away
- Drop custom reference count initialization for initial namespaces
- Add NS_COMMON_INIT() macro and use it for all namespaces
- pid: rely on common reference count behavior
- Miscellaneous Cleanups
- Rename exit_task_namespaces() to exit_nsproxy_namespaces()
- Rename is_initial_namespace() and make argument const
- Use boolean to indicate anonymous mount namespace
- Simplify owner list iteration in nstree
- nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
- nsfs: use inode_just_drop()
- pidfs: raise DCACHE_DONTCACHE explicitly
- pidfs: simplify PIDFD_GET__NAMESPACE ioctls
- libfs: allow to specify s_d_flags
- cgroup: add cgroup namespace to tree after owner is set
- nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Fixes:
- setns(pidfd, ...) race condition
Fix a subtle race when using pidfds with setns(). When the target
task exits after prepare_nsset() but before commit_nsset(), the
namespace's active reference count might have been dropped. If
setns() then installs the namespaces, it would bump the active
reference count from zero without taking the required reference on
the owner namespace, leading to underflow when later decremented.
The fix resurrects the ownership chain if necessary - if the caller
succeeded in grabbing passive references, the setns() should
succeed even if the target task exits or gets reaped.
- Return EFAULT on put_user() error instead of success
- Make sure references are dropped outside of RCU lock (some
namespaces like mount namespace sleep when putting the last
reference)
- Don't skip active reference count initialization for network
namespace
- Add asserts for active refcount underflow
- Add asserts for initial namespace reference counts (both passive
and active)
- ipc: enable is_ns_init_id() assertions
- Fix kernel-doc comments for internal nstree functions
- Selftests
- 15 active reference count tests
- 9 listns() functionality tests
- 7 listns() permission tests
- 12 inactive namespace resurrection tests
- 3 threaded active reference count tests
- commit_creds() active reference tests
- Pagination and stress tests
- EFAULT handling test
- nsid tests fixes"
* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
nstree: fix kernel-doc comments for internal functions
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
...
…ux/kernel/git/vfs/vfs
Pull pidfd and coredump updates from Christian Brauner:
"Features:
- Expose coredump signal via pidfd
Expose the signal that caused the coredump through the pidfd
interface. The recent changes to rework coredump handling to rely
on unix sockets are in the process of being used in systemd. The
previous systemd coredump container interface requires the coredump
file descriptor and basic information including the signal number
to be sent to the container. This means the signal number needs to
be available before sending the coredump to the container.
- Add supported_mask field to pidfd
Add a new supported_mask field to struct pidfd_info that indicates
which information fields are supported by the running kernel. This
allows userspace to detect feature availability without relying on
error codes or kernel version checks.
Cleanups:
- Drop struct pidfs_exit_info and prepare to drop exit_info pointer,
simplifying the internal publication mechanism for exit and
coredump information retrievable via the pidfd ioctl
- Use guard() for task_lock in pidfs
- Reduce wait_pidfd lock scope
- Add missing PIDFD_INFO_SIZE_VER1 constant
- Add missing BUILD_BUG_ON() assert on struct pidfd_info
Fixes:
- Fix PIDFD_INFO_COREDUMP handling
Selftests:
- Split out coredump socket tests and common helpers into separate
files for better organization
- Fix userspace coredump client detection issues
- Handle edge-triggered epoll correctly
- Ignore ENOSPC errors in tests
- Add debug logging to coredump socket tests, socket protocol tests,
and test helpers
- Add tests for PIDFD_INFO_COREDUMP_SIGNAL
- Add tests for supported_mask field
- Update pidfd header for selftests"
* tag 'vfs-6.19-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
pidfs: reduce wait_pidfd lock scope
selftests/coredump: add second PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: add first PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: ignore ENOSPC errors
selftests/coredump: add debug logging to coredump socket protocol tests
selftests/coredump: add debug logging to coredump socket tests
selftests/coredump: add debug logging to test helpers
selftests/coredump: handle edge-triggered epoll correctly
selftests/coredump: fix userspace coredump client detection
selftests/coredump: fix userspace client detection
selftests/coredump: split out coredump socket tests
selftests/coredump: split out common helpers
selftests/pidfd: add second supported_mask test
selftests/pidfd: add first supported_mask test
selftests/pidfd: update pidfd header
pidfs: expose coredump signal
pidfs: drop struct pidfs_exit_info
pidfs: prepare to drop exit_info pointer
pidfd: add a new supported_mask field
pidfs: add missing BUILD_BUG_ON() assert on struct pidfd_info
...
…kernel/git/vfs/vfs Pull folio updates from Christian Brauner: "Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. The helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems: - btrfs - buffer - ext4 - f2fs - gfs2 - iomap - netfs - xfs - mm This fixes a long-standing bug in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this is not a common configuration, but the fix is backported anyway. The other filesystems did not have bugs, they were just mildly inefficient. This also introduce uoff_t as the unsigned version of loff_t. A recent commit inadvertently changed a comparison from being unsigned (on 64-bit systems) to being signed (which it had always been on 32-bit systems), leading to sporadic fstests failures. Generally file sizes are restricted to being a signed integer, but in places where -1 is passed to indicate "up to the end of the file", it is convenient to have an unsigned type to ensure comparisons are always unsigned regardless of architecture" * tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Add uoff_t mm: Use folio_next_pos() xfs: Use folio_next_pos() netfs: Use folio_next_pos() iomap: Use folio_next_pos() gfs2: Use folio_next_pos() f2fs: Use folio_next_pos() ext4: Use folio_next_pos() buffer: Use folio_next_pos() btrfs: Use folio_next_pos() filemap: Add folio_next_pos()
…x/kernel/git/vfs/vfs
Pull cred guard updates from Christian Brauner:
"This contains substantial credential infrastructure improvements
adding guard-based credential management that simplifies code and
eliminates manual reference counting in many subsystems.
Features:
- Kernel Credential Guards
Add with_kernel_creds() and scoped_with_kernel_creds() guards that
allow using the kernel credentials without allocating and copying
them. This was requested by Linus after seeing repeated
prepare_kernel_creds() calls that duplicate the kernel credentials
only to drop them again later.
The new guards completely avoid the allocation and never expose the
temporary variable to hold the kernel credentials anywhere in
callers.
- Generic Credential Guards
Add scoped_with_creds() guards for the common override_creds() and
revert_creds() pattern. This builds on earlier work that made
override_creds()/revert_creds() completely reference count free.
- Prepare Credential Guards
Add prepare credential guards for the more complex pattern of
preparing a new set of credentials and overriding the current
credentials with them:
- prepare_creds()
- modify new creds
- override_creds()
- revert_creds()
- put_cred()
Cleanups:
- Make init_cred static since it should not be directly accessed
- Add kernel_cred() helper to properly access the kernel credentials
- Fix scoped_class() macro that was introduced two cycles ago
- coredump: split out do_coredump() from vfs_coredump() for cleaner
credential handling
- coredump: move revert_cred() before coredump_cleanup()
- coredump: mark struct mm_struct as const
- coredump: pass struct linux_binfmt as const
- sev-dev: use guard for path"
* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
trace: use override credential guard
trace: use prepare credential guard
coredump: use override credential guard
coredump: use prepare credential guard
coredump: split out do_coredump() from vfs_coredump()
coredump: mark struct mm_struct as const
coredump: pass struct linux_binfmt as const
coredump: move revert_cred() before coredump_cleanup()
sev-dev: use override credential guards
sev-dev: use prepare credential guard
sev-dev: use guard for path
cred: add prepare credential guard
net/dns_resolver: use credential guards in dns_query()
cgroup: use credential guards in cgroup_attach_permissions()
act: use credential guards in acct_write_process()
smb: use credential guards in cifs_get_spnego_key()
nfs: use credential guards in nfs_idmap_get_key()
nfs: use credential guards in nfs_local_call_write()
nfs: use credential guards in nfs_local_call_read()
erofs: use credential guards
...
…nux/kernel/git/vfs/vfs
Pull fs header updates from Christian Brauner:
"This contains initial work to start splitting up fs.h.
Begin the long-overdue work of splitting up the monolithic fs.h
header. The header has grown to over 3000 lines and includes types and
functions for many different subsystems, making it difficult to
navigate and causing excessive compilation dependencies.
This series introduces new focused headers for superblock-related
code:
- Rename fs_types.h to fs_dirent.h to better reflect its actual
content (directory entry types)
- Add fs/super_types.h containing superblock type definitions
- Add fs/super.h containing superblock function declarations
This is the first step in a longer effort to modularize the VFS
headers.
Cleanups:
- Inode Field Layout Optimization (Mateusz Guzik)
Move inode fields used during fast path lookup closer together to
improve cache locality during path resolution.
- current_umask() Optimization (Mateusz Guzik)
Inline current_umask() and move it to fs_struct.h. This improves
performance by avoiding function call overhead for this
frequently-used function, and places it in a more appropriate
header since it operates on fs_struct"
* tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: move inode fields used during fast path lookup closer together
fs: inline current_umask() and move it to fs_struct.h
fs: add fs/super.h header
fs: add fs/super_types.h header
fs: rename fs_types.h to fs_dirent.h
…/kernel/git/vfs/vfs Pull superblock lock guard updates from Christian Brauner: "This starts the work of introducing guards for superblock related locks. Introduce super_write_guard for scoped superblock write protection. This provides a guard-based alternative to the manual sb_start_write() and sb_end_write() pattern, allowing the compiler to automatically handle the cleanup" * tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: use super write guard in xfs_file_ioctl() open: use super write guard in do_ftruncate() btrfs: use super write guard in relocating_repair_kthread() ext4: use super write guard in write_mmp_block() btrfs: use super write guard in sb_start_write() btrfs: use super write guard btrfs_run_defrag_inode() btrfs: use super write guard in btrfs_reclaim_bgs_work() fs: add super_write_guard
…kernel/git/vfs/vfs
Pull minix fixes from Christian Brauner:
"Fix two syzbot corruption bugs in the minix filesystem.
Syzbot fuzzes filesystems by trying to mount and manipulate
deliberately corrupted images. This should not lead to BUG_ONs and
WARN_ONs for easy to detect corruptions.
- Add error handling to minix filesystem for inode corruption
detection, enabling the filesystem to report such corruptions
cleanly.
- Fix a drop_nlink warning in minix_rmdir() triggered by corrupted
directory link counts.
- Fix a drop_nlink warning in minix_rename() triggered by corrupted
inode link counts"
* tag 'vfs-6.19-rc1.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Fix a drop_nlink warning in minix_rename
Fix a drop_nlink warning in minix_rmdir
Add error handling to minix filesystem for inode corruption detection
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.