Hotrod is a kernel exploitation challenge created by my friend
FizzBuzz101 for CUCTF 2020. I tested the
challenge before it was released and since the exploitation process was very
interesting, I decided to write this article. In the next sections we
will see how to get a root shell exploiting a UAF, using a single
timerfd_ctx structure and userfaultfd. Let’s get started!
Before touching the kernel module, we need to better understand the system itself, gathering as much information as possible. The golden rule is always the same: system analysis first, then code analysis.
We can start inspecting
#!/bin/sh qemu-system-x86_64 \ -s \ -m 64M \ -nographic \ -kernel "./bzImage" \ -append "console=ttyS0 quiet loglevel=3 oops=panic panic=-1 pti=on kaslr nosmap min_addr=4096" \ -no-reboot \ -cpu qemu64,+smep \ -monitor /dev/null \ -initrd "./initramfs.cpio" \ -smp 2 \ -smp cores=2 \ -smp threads=1
We can immediately see that various protections are enabled.
– SMEP (Supervisor Mode Execution Prevention):
With SMEP, the CPU will generate a fault if we try to directly execute instructions in user space. This means that even if we are able to control the kernel instruction pointer, we cannot simply map an executable memory region in user space, place our code there and execute it.
Luckily, since SMAP (Supervisor Mode Access Prevention) is disabled, we can still access data in higher CPU rings, so we can build our ROP-chain in user space, perform stack pivoting and start executing gadgets. Another approach is to turn off the 20th flag of the CR4 register to disable SMEP and then directly execute code in user space:
– KASLR (Kernel Address Space Layout Randomization):
With this option, every time the system boots up, the location of kernel code in memory is randomized with an entropy of 9 bits. This means that if we want to build a ROP-chain, we will probably need to leak pointers to compute the current kernel base address.
– KPTI (Kernel Page Table Isolation):
KPTI has been implemented in the Linux kernel after the Meltdown security vulnerability. It helps preventing information leaks separating user space and kernel space page tables. Changing the 12th flag of the CR3 register, the system can switch between two sets of page tables. When the system runs in kernel mode, it uses the first set, so it can access both kernel and user address space (The latter for things like copy_to_user etc.). In addition, the NX flag is set in the top level of the user portion of kernel page tables, this way any missed kernel to user CR3 switch will cause a crash. When the system runs in user mode instead, it uses the second set, now it can only access a copy of user address space, and just a limited portion of kernel address space: the code needed for system calls and IDT.
uname -a Linux (none) 5.8.3 #12 Sun August 26 12:00:00 UTC 2020 x86_64 GNU/Linux
The kernel version is very recent, I couldn’t find any known vulnerability to perform privilege escalation.
The QEMU monitor is set to
/dev/null: we cannot use the monitor console to interact with QEMU and control the guest OS.
We can proceed extracting symbol addresses and ROP gadgets. Since the
System.map file is not
provided and we cannot access
unprivileged user, we need to extract the filesystem and modify the
init file to get root privileges.
The filesystem is compressed using the cpio format, we can extract it and replace uid/gid using the following commands:
mkdir fs && cd fs && cpio -idv < ../initramfs.cpio # Extract the archive sed -i 's/setuidgid 1000 sh/setuidgid 0 sh/g' init # Replace the user uid/gid find . | cpio --create --format='newc' > ../initramfs.cpio # Rebuild the archive
The next step is to disable KASLR, we can do this by modifying the kernel command line options:
sed -i 's/kaslr/nokaslr/g' run_challenge.sh
Finally we can get symbols from
I also created
#!/bin/sh HOTROD=$(cat /proc/kallsyms | grep hotrod_ioctl | cut -d " " -f1) echo [*] Module base: 0x$HOTROD
Then I added
/bin/sh /bin/info in the
init file, so every time the system boots up I can get the
Now we can re-enable KASLR, restore the uid/gid to 1000 and rebuild the cpio archive.
To obtain ROP gadgets, we need to extract the kernel image. We can do this
binwalk to locate the vmlinuz file (the
compressed kernel image) inside bzImage:
binwalk bzImage DECIMAL HEXADECIMAL DESCRIPTION -------------------------------------------------------------------------------- 15109 0x3B05 gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00)
Then we can use
dd to extract it and
gunzip to decompress the archive:
dd if=./bzImage bs=1 skip=15109 of=vmlinux.gz && gunzip vmlinux.gz
Now we can finally extract ROP gadgets using ropper or ROPGadget:
ropper --file ./vmlinux --nocolor > gadgets
PS: We could also have used the extract-vmlinux script to extract the kernel image.
file hotrod.ko hotrod.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=6bcf4da490ac3e3ab5db8148eb08238250716d32, with debug_info, not stripped
I won’t cover the reverse engineering process (from asm to pseudo C) in detail to avoid unnecessarily lengthening the article. Here is my high-level representation of the module after static analysis:
As we can see, the init_hotrod() function is using misc_register(), an interface that allows kernel modules to register a misc device.
In Linux, each device is identified by a major and a minor number. The major number is used by the kernel to identify the driver associated with the device. The misc driver is identified by the major number 10. The minor number depends on the device and it’s used by the driver to differentiate among various devices.
misc_register() takes as argument a miscdevice
in this case hotrod_dev.
We can see that in hotrod_dev, three fields are set:
minor: In this case set to 255, corresponds to MISC_DYNAMIC_MINOR and as we can see from kernel.org:
If the minor number is set to MISC_DYNAMIC_MINOR a minor number is assigned and placed in the minor field of the structure.
name: It leaves no room for imagination…
fops: A pointer to the file_operations structure. This structure exposes interfaces needed by users to interact with the device. In this case
unlocked_ioctl()is the only exposed function, which we will cover shortly.
After initialization, the device node is created in
/dev. Now we can perform I/O operation on it:
ls -la /dev | grep hotrod crw-rw-rw- 1 root root 10, 63 Oct 19 09:11 hotrod
The hotrod_ioctl() function, allows us to perform four different operations: Alloc, Free, Show and Edit, but remember, we can perform each action only once!
The allocation size is limited between 0xe0 and 0xf0 bytes, to understand what it means, let’s briefly introduce the Slab Allocator.
The Slab allocator is used by the Linux kernel to group objects of the same size into caches. Each cache consist of one or more slabs and each slab is composed by one or more contiguous page frames. In each slab are stored a certain number of objects.
There are two classes of caches:
General purpose caches: They are called
kmalloc-Nwhere N is a power of two:
kmalloc-256and so on.
Specialized caches: Used for common objects:
vm_area_structand so on.
/proc/slabinfo can be used by privileged users to get information about slabs.
The Slab allocator uses a LIFO scheme to perform allocations and deallocations. The kernel will keep track of freed objects in per-cpu freelists and will serve them when a new allocation of the same size takes place.
Please note that this is just a very basic overwiew of the Slab allocator, for a detailed explanation check the articles in References.
At this point we know that allocations in limited range 0xe0-0xf0 bytes, will end up in kmalloc-256. Let’s proceed analyzing the module source code to spot the bug.
ioctl() implementation ran under Big
Kernel Lock. From The new way
ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL). In the past, the usage of the BKL has made it possible for long-running ioctl() methods to create long latencies for unrelated processes.
It was very inefficient in SMP environment, since during
ioctl operations nothing else could be executed, therefore, two new
functions have been introduced:
compat_ioctl(). In kernel version 2.6.36, the
old ioctl implementation has been completely removed, as we can see from
kill .ioctl file_operation.
From The new way of ioctl() we can also understand the difference between the old and the new
If a driver or filesystem provides an unlocked_ioctl() method, it will be called in preference to the older ioctl(). The differences are that the inode argument is not provided (it's available as filp->f_dentry->d_inode) and the BKL is not taken prior to the call. All new code should be written with its own locking, and should use unlocked_ioctl().
This is what we were looking for. Hotrod is using
unlocked_ioctl() but it does not implement its own locking!
This means that we can use multiple threads to cause a race condition that will result in a Use-After-Free. This, later on, will allow us to ger RIP control.
We can formulate the exploitation plan as follows:
We can use Alloc to get an allocation in kmalloc-256 and find a way to get a leak using Show.
Then we can use Edit to modify the allocated object in kernel space. At the same time, with another thread, we can use Free to release the structure and allocate a victim object in the same location (because of freelists LIFO behavior).
At this point the Edit operation will end up overwriting function pointers in the victim object and we will be able to hijack control flow.
To succeed in our plan, we need two elements:
- An object allocable in kmalloc-256 that can be used both to get an information leak and to hijack control flow.
- A way to make the race condition reliable: we can perform each operation only once, so we need to find a way to maximize the success rate of the race condition.
The Victim Object - timerfd_ctx
As we can see from this useful article, an interesting structure allocated in kmalloc-256 cache, is timerfd_ctx.
The structure in the union are respectively a hrtimer structure:
And an alarm structure:
This object is allocated when a timer instance is created by timerfd_create().
timerfd_ctx is a good candidate since it can be used to leak kernel function pointers (to bypass KASLR), kernel heap addresses and to hijack control flow.
It is important to note that the structure is freed using
kfree_rcu() will deallocate the object after a grace period to ensure it
is no longer used by any thread. We can use
sleep(1) after closing the
timerfd_ctx file descriptor to make sure it
has actually been freed, then we can use Alloc and Show to get an information leak.
To hijack control flow, we can overwrite the function pointer in the
function is automatically called by the kernel when the corresponding timer expires.
timerfd_ctx in memory. We are interested in the highlighted
0xffff888000297900: 0xffff888000297900  0x0000000000000000 0xffff888000297910: 0x0000000000000000 0x00000002ef81037a  0xffff888000297920: 0x00000002ef81037a  0xffffffff81102a00  0xffff888000297930: 0xffffffff8183e080 0x0000000000000000 0xffff888000297940: 0x0000000000000000 0x0000000000000000 0xffff888000297950: 0x0000000000000000 0x0000000000000000 0xffff888000297960: 0x0000000000000000 0x0000000000000000 0xffff888000297970: 0x0000000000000000 0x0000000000000000 0xffff888000297980: 0xbdbbd3bf6c2a6d81 0xffff888000297988 0xffff888000297990: 0xffff888000297988 0x0000000000000000 0xffff8880002979a0: 0x0000000000000000 0xffff88800013eb00 0xffff8880002979b0: 0x00000000000000a8 0x0000000000000000 0xffff8880002979c0: 0x0000000000000000 0x0000000000000000 0xffff8880002979d0: 0x0000000000000000 0x0000000000000000 0xffff8880002979e0: 0x0000000000000000 0x0000000000000000  Address of current timer node (basically the chunk address) [2-3] ktime_t, expiry time of the hrtimer  timerfd_tmrproc() function pointer
Optimizing The Race Condition With Userfaultfd
At this point we have the victim object and an exploitation strategy. Now we need to find a way to make the race condition reliable. We will do it taking advantage of a feature of the Linux kernel: usefaultfd.
userfaultfd, allows unprivileged* user space processes to handle page faults and
perform other memory management tasks, for example it can be used to measure
But this feature also has a dark side: it can be used to suspend kernel threads.
An attacker can start monitoring a specific memory range, let’s say a page of
memory, waiting for page faults. When the kernel tries to access that
page, for example with
copy_from_user(), it will cause a page fault and the control will be transferred to
the page faults handler in user space.
This will give the attacker the ability to suspend the kernel thread for an arbitrary amount of time and reliably exploit possible race conditions.
*From kernel 5.11 usefaultfd is not longer usable by unprivileged users. FUSE is a good alternative.
Now that we have all the pieces of the puzzle, we can reformulate our plan as follows:
To get a memory leak, we can allocate a
timerfd_ctx structure using
timerfd_create(), then we can free the object by closing the associated file
At this point, we can get an allocation at the same location using Alloc and leak the
timerfd_tmrproc() address using Show.
To control the kernel instruction pointer, let’s see what happens when we use Edit:
The user request is copied to kernel space using
copy_from_user() . Then, after a size
req.size bytes are copied from
req.content to the previously allocated memory region using a second
copy_from_user() call. .
This means that if we map a memory region, let’s say a page of memory, and we use it as
req.content, userfaultfd can be used to handle the page fault and suspend the kernel thread in the middle of the copy operation.
We first map a page of memory:
Then, with another thread, we start monitoring the mapped region using userfaultfd, waiting for a page fault. Now we trigger the copy operation using Edit from the main thread: this will cause a page fault in .
At this point the control will be transferred to the page fault handler thread and we will be able to suspend the faulting thread. Now, always with the page fault handler thread, we can:
- Use Free to deallocate the object.
- Allocate a
timerfd_ctxstructure at the same location (because of freelists LIFO behavior).
- Release the faulting thread: the copy operation will overwrite the victim object.
The whole process can be visualized with the following diagram:
+ | | +----------v----------+ | create_timer() | +------+ +----------+----------+ | | | | | +----------v----------+ | | do_alloc() | +----> Leak +----------+----------+ | | | | | +----------v----------+ | | do_show() | +------+ +----------+----------+ | | +----------v----------+ | pthread_create() +--------------------------+ +----------+----------+ | | | | +----------v----------+ | | userfaultfd() | | +----------+----------+ | | | | | +--------->+ | | | | | | | | ... polling ... | | | +----------v----------+ | | | do_edit() | +----------+ +----------+----------+ | | | X PAGE FAULT +------+ | | | | +----------v----------+ | | do_free() | | +----------+----------+ | | | | | +----------v----------+ | | create_timer() | +---> Handle PF +----------+----------+ | | | | | +----------v----------+ | | ioctl_ufd() | | +----------+----------+ | | | | | X RELEASE +------+ | +----------v----------+ | Edit complete (UAF) | +---------------------+
The Exploit - Controlling RIP
We can start writing the helper functions to interact with the device.
A timer can be created using:
Now we need userfaultfd:
And the page fault handler:
As we can see from , with this first POC we should be able to overwrite the kernel RIP with a bounch of “A"s.
Here’s the complete POC code: poc.c
We can compile and run the exploit:
gcc -o poc poc.c -static -s -lpthread
As expected, overwriting the
 we can get RIP control when the timer fires.  Now
we need to create a ROP chain and perform stack pivoting.
It’s time to use GDB: we are interested in the CPU context when
timerfd_tmrproc() is called. We need one of the registers to contain a pointer to a controllable location: here we will place our fake stack address.
Let’s comment the following line in our poc:
Now we can attach GDB to the kernel and set a brakpoint to
When the timer expires,
timerfd_tmrproc() is executed and the breakpoint is hit.
From the CPU context, we can see the
0xffff88800029bc00, the address of
In red, we can see the address of the structure itself, in green
Since the RDI is pointing to the
timerfd_ctx structure, and we have full control over its first field, we can place our fake stack here (Remember that it can be mapped in user space because SMAP is not enabled):
Then, the following gadget can use utilized to perform stack pivoting:
0xffffffff81027b86: mov esp, dword ptr [rdi]; lea rax, [rax + rsi*8]; ret;
We can make the following changes in the
And execute the exploit again:
Success! After stack pivoting, we overwritten RIP with a bounch of
“B"s , and the RSP now contains
. It’s time to finalize our ROP-chain!
The Exploit - From “B"s To Root Shell
As a first attempt, let’s try to read the flag. We can extend the ROP chain replicating the effect of
In Linux every task has its own set of credentials defined by cred structure.
cred, specifies the security context of the task.
prepare_kernel_cred() will allocate a new set of credentials uid,gid etc. set to 0 and commit_creds() will apply it to the current task. This way we will be able to get root privileges.
Then we need to change the 12th flag of the CR3 register (remember that
KPTI is enabled), use
swapgs to swap GS back
to the user GS, saved in MSR and then use
return to user space.
We can do it in different ways. For example we can
use the symbol
GDB we can see that
swapgs_restore_regs_and_return_to_usermode + 0x16
is a perfect gadget for us:
We can find similar instructions when the system returns to user space after a syscall:
The only difference is that the first gadget utilizes
iretq, the second one
iretqexpects the following stack layout when returning to user space:
+-------------------+ | RIP | +-------------------+ | CS | +-------------------+ | RFLAGS | +-------------------+ | RSP | +-------------------+ | SS | +-------------------+
sysretqaccepts user space RIP from RCX and RFLAGS from R11 instead.
We can save the current processor state using:
And read the flag with:
I chose the the second gadget to return to user space, here is how the ROP chain looks like:
Let’s test the modified exploit:
The kernel crashed, but we successfully read the flag . Cool, but it’s not enough. We want a shell!
Initially I tried to replace the
read_flag() function with:
Unfortunately, I could not get it to work, but after spending some time experimenting, I found and alternative approach to get a shell.
First, we need to fix the
timerfd_ctx structure we corrupted in
the previous steps. I replaced the first address with the
timerfd_ctx address and the sixth address,
(now overwritten by the stack pivot gadget) with:
So when the function pointer will be called again, the call will simply return.
Even after these changes, I could not use execve/execveat etc so I opted for a different strategy.
In Linux, when a user executes a program with an unknown program header,
the system calls
that in turn calls
to execute the program specified by
modprobe_path, set by default to
If we overwrite
modprobe_path with the location
of a malicious program, for example
/home/user/x, every time a file with an
unknown program header is executed, the system will run our script with root privileges.
We can use the following function to automatically create a file with an
unknown program header,
/home/user/asd, and a
script that will add a new user
Now we can modify the ROP chain to overwrite
modprobe_path using a write-what-where gadget:
Now we only need to find a way to prevent the kernel from crashing after returning to user space. Surprisingly I was able to restore execution with this function:
This way, after hijacking
modprobe_path, our exploit will successfully
exit and we will be able to execute
to force the kernel executing our malicious script:
I also tried to trap the thread with
int3 and it worked too.
It is worth noting that to maximize the exploit success rate, we need
perfect timing. I found the right compromise using
ioctl_userfaultfd(). Why is it needed? Well, no clue, I still have to dig deeply.
This will be our final exploit: exploit.c, utils.h
Finally, we can enjoy our root shell!
The challenge can be found here: here.
- https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html (p. 2880)
- https://www.cs.hs-rm.de/~kaiser/events/wamos2018/Slides/mueller.pdf (n. 15)