In recent years, with the rise of new kernel mitigations - from CFI to Google’s SLUB virtual to randomized slab caches - there has been a trend shift from slab-level object exploitation to page-level exploitation. All the new attacks have something in common: how the Linux kernel manages pages of memory. In this article, I decided to organize some of my sparse notes about this topic.
Overview
Behind the scenes, the Linux page allocator utilizes the buddy memory allocation technique to manage blocks of memory in powers of two. The base block (the smallest portion of contiguous memory), consists of a 4KB page, larger blocks are this base unit times powers of two.
The exponent is referred to as the order. There are NR_PAGE_ORDERS (11) possible orders, ranging from 0 to MAX_ORDER (10). Consequently, the largest portion of contiguous memory managed by the buddy system corresponds to 4KB * 210, or 4MB, which is equivalent to 1024 4KB pages.
The page order is stored in the page->private (see set_buddy_order() -> set_page_private()) of the first page in the block. The entire block is considered itself a page of order N.
When a page of a given order is requested but not available, the kernel will start looking for higher order pages, until one found. At this point the page is split in half, creating two lower level blocks (two “buddies”). The process is repeated until a page of the requested size is obtained.
The inverse happens when a page is freed and its buddy is also free; the two are coalesced into a higher order block. Even in the coalescing case, the process is repeated until order MAX_ORDER is reached or there is not an available buddy that can be merged.
The purpose of the buddy system is to maintain the largest possible portions of continuous memory, so to minimize external fragmentation, the kernel also differentiates between pages based on their mobility.
Pages with identical mobility are grouped together. The migration type is determined by the GFP flags used when requesting a new page: for instance, for user-space allocations, the GFP_HIGHUSER_MOVABLE flag is set (see do_cow_fault() for example), whereas for some slabs SLAB_RECLAIM_ACCOUNT flag is set (see inode_init() etc.), and so on.
Migration types are defined in the migratetype enum:
MIGRATE_UNMOVABLE: Pages of this type cannot be moved. This group comprises slabs, page tables, etc.
One of the reasons why these pages cannot be moved, can be found using the QEMU monitor, and the info tlb command to obtain the virtual address to physical address mapping.
For example, here is the area where kmalloc slabs are allocated:
We have a 1:1 linear mapping. The virtual address is calculated as the physical address plus a constant (page_offset_base in this case).
Therefore, moving data between these physical page frames would also require updating the corresponding virtual address, which would cause invaid memory accesses in all the processes that still refer to the old virtual address.
For user-space allocations, this does not happen, as when a page is migrated, the physical address changes, but the virtual address remains the same. Only the corresponding page table entry is modified.
MIGRATE_MOVABLE: These pages can be migrated. This type is mainly used for user-space allocations. These pages are easy to move as they are accessed via page tables, therefore, when the physical address is modified, it is only necessary to update the corresponding PTE.
MIGRATE_RECLAIMABLE: These pages cannot be moved, but they can be reclaimed under memory pressure. Their contents can be restored from disk later.
MIGRATE_HIGHATOMIC: Pages with this migration type are reserved for high-order atomic allocations.
MIGRATE_CMA: This migration type is used by pages managed by the CMA allocator. Only movable pages can be allocated from pageblocks with this type.
MIGRATE_ISOLATE: Type temporarily used during the migration of movable pages.
MIGRATE_PCPTYPE: This is a special type used by per-CPU pagesets, we will analyze it more in detail later.
When a page of a given order is requested, the allocator will attempt to allocate the it from pageblocks with the requested migration type. If this is not possible, pages will be “stolen” from other pageblocks with a different migration type, starting from the largest page blocks to avoid memory fragmentation.
Notice how, from an attacker’s prospective, this also makes page mixing between user-space (movable) pages and slab (unmovable) pages possible. For example, when unmovable pages are exhausted, pages can be stolen from pageblocks with the movable migration type.
Before digging into how pages are organized in freelists, we need to take a step back and focus for a moment on how the RAM is divided into different zones.
Nodes, Zones and Free Lists
On NUMA machines, the memory layout of each NUMA node is described by the pglist_data structure:
typedefstructpglist_data{/*
* node_zones contains just the zones for THIS node. Not all of the
* zones may be populated, but it is the full list. It is referenced by
* this node's node_zonelists as well as other node's node_zonelists.
*/structzonenode_zones[MAX_NR_ZONES];/*
* node_zonelists contains references to all zones in all nodes.
* Generally the first zones will be references to this node's
* node_zones.
*/structzonelistnode_zonelists[MAX_ZONELISTS];// ...
}
The memory of a node is divided into multiple zones, each zone describing a segment of memory:
Zone DMA: Corresponds to the lower 16MB of memory. It primarily exists for backward compatibility with some old ISA-devices.
Zone DMA32: This zone only exists on 64-bit systems and covers approximately the lower 4GB of RAM. There is a class of hardware that can only DMA in this zone.
Zone Normal: On 32-bit systems, it corresponds to all RAM from 16MB to 896MB. On 64-bit systems, it corresponds to all RAM from approximately 4GB and above. This zone can be very small for a 64-bit system with 4GB of RAM, and it is disabled if the system has less than 4GB of RAM.
Zone HighMem: 32-bit only. This zone includes all the RAM above 896MB.
Zone Movable: Similar to Zone Normal, but most of the pages in this zone are movable.
Zone Device: Represents memory residing on devices (e.g., GPU). It is only enabled if CONFIG_ZONE_DEVICE=y.
Upon page allocations, the kernel specifies a “preferred” zone. If this zone doesn’t have enough free memory, other zones are used.
For example, the Normal zone on a 64-bit machine with 4GB of RAM can fill up very quickly, so subsequent allocations will be performed in the DMA32 zone, even if the preferred zone is Normal.
With all this in mind, it is possible to analyze the information about the status of memory from the /proc interface.
Here is the output of /proc/buddyinfo for a QEMU VM with 8GB RAM, right after boot. Lower order freelists are empty. Most of the blocks are order 10 pages, indicating minimal memory fragmentation.
Node 0, zone DMA 00000000013Node 0, zone DMA32 2100123322730Node 0, zone Normal 00110221211218
Just for comparison, here is the /proc/buddyinfo output for a 64-bit machine with 32GB RAM after prolonged usage. There are many free pages in lower order, indicating that compaction was not possible. This his a warning indicator as it suggests a very high level of memory fragmentation:
Node 0, zone DMA 00000000112Node 0, zone DMA32 233224822078163612558634801936117116Node 0, zone Normal 63676657862904593873127118261148230648162
Back to the 8GB VM, a more detailed view can be obtained using /proc/pagetypeinfo. In this case, it is also possible to notice how the free pages are distributed across multiple migration types:
Page block order: 9Pages per block: 512Free pages count per migrate type at order 012345678910Node 0, zone DMA, type Unmovable 00000000000Node 0, zone DMA, type Movable 00000000013Node 0, zone DMA, type Reclaimable 00000000000Node 0, zone DMA, type HighAtomic 00000000000Node 0, zone DMA, type CMA 00000000000Node 0, zone DMA, type Isolate 00000000000Node 0, zone DMA32, type Unmovable 10000011010Node 0, zone DMA32, type Movable 1100122221730Node 0, zone DMA32, type Reclaimable 00000000000Node 0, zone DMA32, type HighAtomic 00000000000Node 0, zone DMA32, type CMA 00000000000Node 0, zone DMA32, type Isolate 00000000000Node 0, zone Normal, type Unmovable 00100110010Node 0, zone Normal, type Movable 00000011101218Node 0, zone Normal, type Reclaimable 00010100100Node 0, zone Normal, type HighAtomic 00000000000Node 0, zone Normal, type CMA 00000000000Node 0, zone Normal, type Isolate 00000000000Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
Node 0, zone DMA 170000Node 0, zone DMA32 215260000Node 0, zone Normal 9524614000
Each memory zone, is described by a zone structure.
Each zone manages free pages in two separate areas:
Per-CPU area: This area is described by the per_cpu_pages structure and initialized by setup_zone_pageset(). It is utilized as a sort of page cache for the CPU to quickly satisfy requests of order <= PAGE_ALLOC_COSTLY_ORDER (3).
Buddy free area: This is the heart of the Buddy allocator. If the per-CPU lists cannot satisfy a page request, or if the requested order is greater than PAGE_ALLOC_COSTLY_ORDER (3), then pages in this area are utilized.
In the per_cpu_pages structure, free pages are organized into multiple lists stored in the lists[] array:
structper_cpu_pages{spinlock_tlock;/* Protects lists field */intcount;/* number of pages in the list */inthigh;/* high watermark, emptying needed */intbatch;/* chunk size for buddy add/remove */// ...
structlist_headlists[NR_PCP_LISTS];}
The array can contain up to NR_PCP_LISTS lists. This variable is defined as follows:
Therefore, if CONFIG_TRANSPARENT_HUGEPAGE is not enabled, we can have 12 freelists in the PCP free area. To derive the index in the array from the specified migration type and order, the order_to_pindex() function is used:
This may initially appear a bit confusing (at least it was for me), but it is really straightforward. Since lists of pages are stored in a single array, they are organized based on both page order and migration type, with MIGRATE_PCPTYPES (3) migration types per order.
This means, for example, that if a MIGRATE_UNMOVABLE (0) order 1 page is requested, order_to_pindex(/*migratetype*/ 0, /*order*/ 1) will return (MIGRATE_PCPTYPES * 1) + MIGRATE_UNMOVABLE or (3 * 1) + 0 = 3 and as you can see below, the index corresponds to the requested list of pages.
zone->per_cpu_pageset->lists[0]// Order 0, MIGRATE_UNMOVABLE
zone->per_cpu_pageset->lists[1]// Order 0, MIGRATE_MOVABLE
zone->per_cpu_pageset->lists[2]// Order 0, MIGRATE_RECLAIMABLE
zone->per_cpu_pageset->lists[3]// Order 1, MIGRATE_UNMOVABLE <--- idx 3
zone->per_cpu_pageset->lists[4]// Order 1, MIGRATE_MOVABLE
zone->per_cpu_pageset->lists[5]// Order 1, MIGRATE_RECLAIMABLE
...
Moving to the Buddy allocator’s free_area array, it can hold NR_PAGE_ORDERS different free areas. Here, the page order is used as an index in the array.
Each free area is described by the free_area structure and contains an array of freelists indexed by migration type:
So, to summarize, the free_area array is indexed by order:
zone->free_area[0]// Order 0
zone->free_area[1]// Order 1
zone->free_area[2]// Order 2
...zone->free_area[NR_PAGE_ORDERS-1]// Order 10
and for each order, the free lists are indexed by migration type:
// Order 0
zone->free_area[0].free_list[MIGRATE_UNMOVABLE]zone->free_area[0].free_list[MIGRATE_MOVABLE]zone->free_area[0].free_list[MIGRATE_RECLAIMABLE]...// Order 1
zone->free_area[1].free_list[MIGRATE_UNMOVABLE]zone->free_area[1].free_list[MIGRATE_MOVABLE]zone->free_area[1].free_list[MIGRATE_RECLAIMABLE]...// ...
The reason why this approach with nested arrays is not used by PCP lists remains obscure to me.
All this can be visualized by the following diagram:
Source Code Analysis
The following analysis is conducted on Linux 6.6.84
I think the best way to approach this quick code analysis is to start with a practical example. Let’s create a scenario where a user (Bob, of course) executes the following command on his server to create a named pipe, open it, and read from it:mkfifo /tmp/fifo && cat /tmp/fifo.
In kernel-space, fifo_open() is called. This function attempts to allocate a pipe_inode_info structure via alloc_pipe_info(). However, in our scenario, all kmalloc-cg-192 slabs are full, including CPU-partial slabs. The kernel needs to allocate a new slab.
According to /proc/slabinfo, kmalloc-cg-192 slabs, require one order 0 page, so __alloc_pages() will request a page of this order.
The following diagram illustrates the chain of calls the kernel utilizes to allocate a kmalloc-cg-192 slab in a situation where all the slabs are full. Let’s go down the rabbit hole:
alloc_pipe_info()kzalloc()...// All slabs full, allocate new one
__kmem_cache_alloc_node()...__slab_alloc()// Request order 0 page for kmalloc-cg-192
allocate_slab()alloc_slab_page()...__alloc_pages()// Prepare the allocation context and attempt to get a page from freelists, else fallback to slowpath (__alloc_pages_slowpath())
get_page_from_freelist()// Try to get a page from the freelists of the preferred zone
rmqueue()// Order <= 3, use per-CPU lists else fallback to buddy
rmqueue_pcplist()// Get PCP list by order and migration type
__rmqueue_pcplist()// PCP empty? Fallback to buddy, else return page
rmqueue_bulk()// Request multiple order 0 pages to populate PCP list
__rmqueue()// Try to find a page for the designated order and migration type else fallback and steal page from other migration types (steal_suitable_fallback())
__rmqueue_smallest()// Iterate over 11 page orders, from 0 to 10 to find a free page
get_page_from_free_area()// Order N page found
del_page_from_free_list()// Unlink the page from the original freelist
expand()// Split into lower order pages if needed until order is 0
__add_to_free_list()// Add to new list
set_buddy_order()// Update page order
Our analysis begins with __alloc_pages(). Prior to this function, all the calls are still within the SLUB allocator’s “domain”. For a deep dive into the SLUB allocator I strongly recommend this recent presentation by Andrey Konovalov: SLUB Internals for Exploit Developers.
When __alloc_pages() is called, it starts by preparing the allocation context, based on the requested order, the preferred node ID and the provided GFP flags. Once the allocation context is ready,
the function first attempts to get a free page using get_page_from_freelist(). If the call fails, it falls back to the slowpath, calling __alloc_pages_slowpath():
structpage*__alloc_pages(gfp_tgfp,unsignedintorder,intpreferred_nid,nodemask_t*nodemask){structpage*page;unsignedintalloc_flags=ALLOC_WMARK_LOW;gfp_talloc_gfp;/* The gfp_t that was actually used for allocation */structalloc_contextac={};if(WARN_ON_ONCE_GFP(order>MAX_PAGE_ORDER,gfp))returnNULL;gfp&=gfp_allowed_mask;gfp=current_gfp_context(gfp);alloc_gfp=gfp;if(!prepare_alloc_pages(gfp,order,preferred_nid,nodemask,&ac,&alloc_gfp,&alloc_flags))returnNULL;alloc_flags|=alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref),gfp);page=get_page_from_freelist(alloc_gfp,order,alloc_flags,&ac);if(likely(page))gotoout;// ...
page=__alloc_pages_slowpath(alloc_gfp,order,&ac);// ...
returnpage;}
get_page_from_freelist() scans the zonelist and looks for a zone with enough free pages. The allocation context prepared by __alloc_pages() here is used to specify the preferred zone (via ac->preferred_zone). Once a suitable zone is found, rmqueue() is called to actually get the free page for the requested order and migration type:
staticstructpage*get_page_from_freelist(gfp_tgfp_mask,unsignedintorder,intalloc_flags,conststructalloc_context*ac){structzoneref*z;structzone*zone;structpglist_data*last_pgdat=NULL;boollast_pgdat_dirty_ok=false;boolno_fallback;retry:/*
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_node_allowed() comment in kernel/cgroup/cpuset.c.
*/no_fallback=alloc_flags&ALLOC_NOFRAGMENT;z=ac->preferred_zoneref;for_next_zone_zonelist_nodemask(zone,z,ac->highest_zoneidx,ac->nodemask){// ... scan the zonelist ...
try_this_zone:page=rmqueue(zonelist_zone(ac->preferred_zoneref),zone,order,gfp_mask,alloc_flags,ac->migratetype);// ...
}if(no_fallback){alloc_flags&=~ALLOC_NOFRAGMENT;gotoretry;}returnNULL;}
rmqueue() checks if the requested order is less than or equal to PAGE_ALLOC_COSTLY_ORDER (3). If so, it attempts to use the PCP lists. If the order is greater than PAGE_ALLOC_COSTLY_ORDER, then PCP lists are bypassed and there is a direct fallback to the Buddy allocator. In our case the requested order is 0, so the kernel attempts to use PCP lists first:
__no_sanitize_memorystaticinlinestructpage*rmqueue(structzone*preferred_zone,structzone*zone,unsignedintorder,gfp_tgfp_flags,unsignedintalloc_flags,intmigratetype){structpage*page;// If order <= PAGE_ALLOC_COSTLY_ORDER (3)
if(likely(pcp_allowed_order(order))){page=rmqueue_pcplist(preferred_zone,zone,order,migratetype,alloc_flags);if(likely(page))gotoout;}page=rmqueue_buddy(preferred_zone,zone,order,alloc_flags,migratetype);out:// ...
returnpage;}
rmqueue_pcplist(), retrieves the appropriate PCP list by utilizing order_to_pindex(migratetype, order) to derive the index in the lists[] array from the provided migration type and order. The retrieved list is then passed as an argument to __rmqueue_pcplist():
__rmqueue_pcplist() is called. It first checks if a page of the desired order is available in the PCP list, if not, it falls back to the Buddy allocator, using rmqeuue_bulk() to allocate a batch of N pages, based on the batch value stored in the per_cpu_pages structure:
rmqueue_bulk() calls __rmqueue()count times, based on the pageset batch size retrieved in __rmqueue_pcplist(). The returned page, is then added to the PCP list.
__rmqueue() uses __rmqueue_smallest() to find a page in the Buddy allocator’s free areas for the given migration type. If this attempt fail, and this is not a CMA allocation request, then it uses __rmqueue_fallback() to steal pages from other pageblocks with a different migration type (see __rmqueue_fallback() -> find_suitable_fallback(), get_page_from_free_area(), steal_suitable_fallback()):
We finally reached the core part. __rmqueue_smallest() iterates over the zone->free_area array, starting from order N (N=0 in our case) to NR_PAGE_ORDERS - 1 (11 - 1), using the order as an index in the array.
If a page for the designated migration type is found in the corresponding freelist, the page is removed from the original list by del_page_from_free_list(), processed by expand(), and then returned:
static__always_inlinestructpage*__rmqueue_smallest(structzone*zone,unsignedintorder,intmigratetype){unsignedintcurrent_order;structfree_area*area;structpage*page;/* Find a page of the appropriate size in the preferred list */for(current_order=order;current_order<NR_PAGE_ORDERS;++current_order){area=&(zone->free_area[current_order]);page=get_page_from_free_area(area,migratetype);if(!page)continue;del_page_from_free_list(page,zone,current_order);expand(zone,page,order,current_order,migratetype);set_pcppage_migratetype(page,migratetype);trace_mm_page_alloc_zone_locked(page,order,migratetype,pcp_allowed_order(order)&&migratetype<MIGRATE_PCPTYPES);returnpage;}returnNULL;}
If the order of the page found is greater than the requested order, expand() will split the page into lower order pages until the order is equal to the requested one.
For example, if an order 0 page has been requested, but only an order 1 page is available for the specified migration type, the order 1 page is split into two order 0 pages (two “buddies”).
One of the buddies is then added to the order 0 freelist for the same migration type, the other is returned.
The inverse of this process (coalescing) is performed by __free_one_page() when a page of order N is freed and the order is less than MAX_ORDER. Before adding the free page to the appropriate freelist, if its buddy is also free, the two pages are combined and moved up by one order. The coalescing process continues until a free buddy is found in the new order and the order is < MAX_ORDER:
It is worth mentioning that before adding the free page to buddy freelists, if the order is less than or equal to PAGE_ALLOC_COSTLY_ORDER there is an attempt to add the page to PCP lists (see, for example, the __free_pages() -> free_the_page() -> free_unref_page() -> free_unref_page_commit() path). If too many pages are contained in the PCP lists, then they moved to the appropriate buddy freelists (see the call to free_pcppages_bulk() in free_unref_page_commit()). All this, to reduce memory fragmentation.
In both cases - after splitting two buddies or after coalescing them - set_buddy_order() is used to update the page->private field of the first page in the block with the new order:
staticinlinevoidset_buddy_order(structpage*page,unsignedintorder){set_page_private(page,order);__SetPageBuddy(page);}staticinlinevoidset_page_private(structpage*page,unsignedlongprivate){page->private=private;// new page order
}
Examples
Let’s return to our original scenario. The kernel needs to allocate a new kmalloc-cg-192 slab, so an order 0 page is requested. Here are some possible cases that can help visualize what is going on behind the scenes:
Example 1: The order 0 page is found in the appropriate PCP list
Step 1: rmqueue_pcplist() finds the PCP pageset in the chosen zone, then it picks appropriate PCP list based on requested order (0) and migration type (MIGRATE_UNMOVABLE).
Step 2: __rmqueue_pcplist() obtains the first available page from the list, then the page is unlinked and returned.
Example 2: The order 0 page is not found in the PCP list, fallback to the Buddy allocator
Step 1: __rmqueue_smallest() starts by iterating through the free_area of the designated zone, starting from order 0. The order 0 page is found in the the order 0 freelist for the MIGRATE_UNMOVABLE migration type.
Step 2: The order 0 pages is unlinked from the freelist, expand() is called, but there is nothing to do since the requested order and the page order are the same. The page is then returned.
Example 3: The order 0 page is not found in the PCP list, fallback to the Buddy allocator. Freelists from order 0 to 2 for MIGRATE_UNMOVABLE are empty
Step 1: The function __rmqueue_smallest() begins by iterating through the free_area of the designated zone, starting from order 0. All freelists for the MIGRATE_UNMOVABLE migration type, from order 0 to 2, are empty. The first available page is an order 3 page.
Step 2: The order 3 page is unlinked and passed to the expand() function to be split into multiple buddies.
Step 3: The expand() function starts splitting the page into multiple buddies:
Step 3.1: The order 3 page is split into two order 2 pages. One of the buddies is added to the order 2 freelist for the MIGRATE_UNMOVABLE migration type.
Step 3.2: The remaining order 2 page is split into two order 1 pages. One of the buddies is added to the order 1 freelist for the MIGRATE_UNMOVABLE migration type.
Step 3.3: Finally, the remaining order 1 page is split into two order 0 pages. Of these, one is added to the order 0 freelist for the MIGRATE_UNMOVABLE migration type, while the other is returned.
The Linux page allocator is a complex system, and this short analysis is just the tip of the iceberg. In any case, I hope you found it useful! I plan to extend (or maybe I should say expand()…) the article by adding more information, new diagrams, and notes as soon as I have time. For every question, correction or clarification, feel free to contact me.