malloc is probably the single biggest reason why software quality has declined over the last thirty years, not just for C, but across the board as plenty of programming languages and libraries out there use it. The way it and its sister functions realloc and free work is diametrically opposed to how memory management works on most systems, with "most systems" here meaning those that perform virtual address translations via a Memory Management Unit (MMU) or some equivalent. And no, MMUs are not abstracted away by the kernel, nor is it someone that should be abstracted away even if you asked.
malloc is a usermode allocator, which means that it needs to trap into the kernel just like you would have to in order to allocate the memory that you requested, but it adds its personal layer of bullshit to it for two reasons: the first is that kernelmode allocators operate on pages, not on individual bytes. What's a page? Well, that depends on your hardware and configuration. The usual size is 4 KiB (4096 Bytes), but for twenty years now CPUs (MMUs, really) had the ability to also operate on 2-MiB and 1-GiB pages. What's the benefit of those? Remember when I said that "most systems" perform virtual address translations? Yeah, so those translations are stored in memory, in a structure called the page table, that the kernel manages for your process and in which all address regions that your process is supposed to have access to are listed. If it's not listed there, then you can't access it. If it IS listed there you still might not be able to access it, depending on your process' privilege on that page (can't write into a readonly page for instance). This isn't a Linux or Windows thing by the way, this is just how the hardware works, there's special registers and everything for this shit.
The problem is that walking the page table is very slow for the MMU to do (couple thousand cycles for sure, if not even more), which is why it keeps recent translations in a CPU-near cache called the Translation Lookaside Buffer (TLB). The problem with the TLB is that it's located on the CPU, which means it's small like everything else on the CPU is, which means that you can only have so many translations before it has to evict older entries. Entries that you may well still have used, and that now need to be retrieved during long evening walks on the page table again. Bigger page sizes reduce the amount of entries that the TLB has to keep track of, which results in fewer evening walks. Instead of using 512 4-KiB page translations you can use a single 2-MiB page that gets the job done just as well, if not better (because it only has to be retrieved once, not 511 more times).
In constrast to all of this malloc allows you to use allocations on byte granularity instead. Big deal, right? The second thing it does for you that necessitates its userspace layer of bullshit is splicing lifetimes. And that it does very well - your 2-MiB page can be a host for thousands of relatively small allocations that each have their individual lifetime; you can free some, reallocate some, free others again, allocate a new one - it's very generic. malloc is a general-purpose memory allocator, and if that's what you need, then malloc will make you happy.
The thing is that most often people, and by that I mean developers, and by that I mean developers who really should know what they need, don't know what they need at all. Whenever the need to allocate memory arises they go down a very small checklist with only two items: stack, and malloc. If it's too big: malloc. If it needs to survive the return: malloc. That's it. That's all that's going on in their heads. And that's bad if that memory is used for more than just a bucket of storage across function calls. What to do if it turns out that you need more memory than your initial allocation? Ah, that's what realloc is for. But how does that work?
Well, it works by looking up if malloc has placed another bucket right next to the end of your particular bucket in the meantime. If there isn't, then realloc can just resize the thing in place, but if there is ... well, it probably won't take too kindly if realloc gave its memory to some old new allocation. Chances are, in general, pretty high that realloc won't be able to resize your bucket without moving it someplace else entirely, which is called "relocation". Relocations are expensive. They're usually done by first allocating a new bucket with the new size, then memcpying the old data from the old bucket into the new bucket, and then freeing the old bucket. Hopefully you didn't have any dangling pointers on that old bucket, because that bucket's gone now. It should be noted that memcpy, even if it's a well optimized version (and it's not on Windows, I can tell you that), will still thrash your CPU caches, with the amount of damage done depending on how much data it copied (unless it's the kind of memcpy that does non-temporal shit. Non-temporal shit bypasses the cache. As in, even if it's already in the cache it's still going to bypass the cache. Which is retarded.)
OK, let's say you want to avoid relocations, so you allocate a big old chunk of memory that you're certain you'll never end up using completely, and then you slap a realloc at the end of it just in case. The payoff for this will come a little big later, but for now you should be aware of three things:
malloc drives up the commit charge on both Windows and Linux. On Linux you can enable overcommit, but then the out-of-memory-killer is hovering over you.
Because malloc drives up the commit charge, and the second buffer has to be allocated before the copy can start, there's a chance that the reallocation can just fail while you still have 30% of physical memory free.
Even if you never run into your realloc the memory that you just malloced is going to be unavailable for other processes. Those 4-GiB that you allocated just to be sure? Yeah, they're gone until your process dies, even if you only ever end up using 912 bytes.
And then there's multithreading. Not everything malloc does requires atomicity or locking, but some things it has to keep book about in global structures would probably break if two or more threads tried to access them at the same time - so malloc requires its own set of locks (on top of the kernel locks by which your process' page table entries are protected; just think about that for a minute). A call to malloc or free, even if the memory is already available in usermode, easily takes over 1000 cycles. A certain program of mine that shall go unnamed had its first multithreaded version complete in 16 minutes using malloc; by the end of it malloc was gone and it completed in 7-8 seconds.
Now, even if all you ever needed was a static bucket of memory to place your data into, and you never resize or move it, you probably still don't want to use malloc because you have no control over whether or not it places your data right next to each other, like in use for bigger pages, or if spatial locality is required for proper CPU cache utilization - but even if you did, somehow (probably with minmalloc), then you probably still don't want to use malloc because malloc uses some of the memory it requests from the kernel to store its own state in there. Not only is that information a waste of space, but if it gets clobbered by a buffer overflow from the previous bucket it can lead to all sorts of fun (not limited to minmalloc either): https://daniel.haxx.se/blog/2016/10/14/a-single-byte-write-opened-a-root-execution-exploit/
Also keep in mind that allocation is just one side of the coin; release is the other. Every call to malloc requires an equal and opposite call to free; there no way you can tell malloc to place the next allocation(s) into a specific pool of objects which all share the same lifetime (because malloc doesn't expose these buffers/mappings/arenas to the outside world, although I've heard that minmalloc does ...? Haven't verified it though), which would make the release of all of these objects trivial. But no, you have to go through every single object and free it just as you've allocated it - there's no free_all function. And even if there was, how would free_all know when to stop? There's allocations you don't want removed because your process depends on them implicitly, like locales or program parameters or exceptions (on Windows at least) ... now, granted, you can allocate an arena from malloc and then place all objects with the same lifetime in there; that way you'll be able to remove them all as simply as setting the stack pointer register (RSP) back to whatever value it was previously, but reallocations (and relocations) will still be a bitch if you don't know how many elements you end up putting in that buffer.
In short: there is absolutely no reason whatsoever to consider malloc the default allocator. Only use malloc if you absolutely NEED your objects to have an individual lifetime. Which happens, y'know. Like in higher programming languages, as is evident by the fact that, rather than merging lifetimes together, they all use reference counting, RAII, or garbage collection to make sure that your memory gets freed whenever it goes out of scope. The reason why they're all using reference counting, RAII, or garbage collection in the first place is because compilers/interpreters/virtual machines are really, really, really bad at merging objects with the same lifetimes. They can keep track of them no problem, but they can't merge them. If they could do that they wouldn't have to call malloc and free for every single object that they encounter and then add some additional bookkeeping to it; if they were able to recognize that 9001 objects have literally the same lifetime they could allocate and free them with a single call.
But guess who knows exactly what lifetimes certain objects have, which ones can be merged, and which ones can't? That's right. (You). And what's more, (You) also know what (You) don't know, which is how many bytes you actually might end up processing. You either know, in which case you can allocate the exact amount of bytes, or you don't know, in which case you have to go through the ridiculousness that is realloc, but you always know whether or not you know. Compilers/interpreters/VMs don't, and in fact might never know. They can guess, but they can't know, which is why higher level languages will be almost always slower than an implementation in C that has its lifetimes straightened out. As far as I can tell most languages don't even support a proper mmap binding (they always focus on the file-mapping part, not the memory-allocation part); C++ does, and interfaces that support std::pmr::polymorphic_allocator are probably able to make use of such mappings (which not all of them do), but honestly ... why bother at this point?
At the beginning I wrote that malloc needs to trap into the kernel in order to allocate the memory it hands out to you. The way this is done is well documented for both Windows and Linux - by calling these functions:
LPVOID VirtualAlloc(LPVOID lpAddress,SIZE_T dwSize,DWORD flAllocationType,DWORD flProtect);
void *mmap(void addr,size_t length,int prot,int flags,int fd,off_t offset);
There's others, even for both systems, and I'm going to namedrop them in case you want to research them further, but I'm not going to discuss them in this primer - for Linux there's brk and sbrk, on Windows there's NtAllocateVirtualMemory and NtAllocateVirtualMemoryEx (and technically also Global/Local/HeapAlloc/Free, but you can ignore them - in fact malloc on Windows is implemented via kernelbase.HeapAlloc, which is just a reference to ntdll.RtlAllocateHeap).
Let's get back to VirtualAlloc and mmap - not only do they have considerably more parameters than malloc, but they also look similar despite being from different kernels (since mmap is also used for file mapping creation it has two additional parameters that are ignored for memory allocation purposes). The advantage with these allocators is that you can't just allocate memory with then, but much more importantly virtual address space, of which we have plenty these days on our 64-bit processors and 48-bit address busses (256 TiB, to be precise, although half of it is not accessible to a process because the other half is reserved and used by the kernel, on both Windows and Linux even). A virtual address space allocation just tells the kernel that this range is yours now, and you never want the kernel to ever return addresses in that range unless you explicitly ask it to (like during the actual commitment of memory). That is very easy for the kernel to keep track of, and as such a mere reservation doesn't require much memory at all.
On windows this process is almost self-explanatory - assume that "a" is an object of an arena type which keeps track of the position of the mapping, the size of the reserved, and the size of the committed range:
/*Get a reservation, doesn't drive up the commit charge.*/
size = TIB_2_B(16);
if(NULL == (ptr = VirtualAlloc(NULL,size,MEM_RESERVE,PAGE_READWRITE))
goto LABEL_END;
a.a_ptr = ptr;
a.a_reserved = reserved;
/*Get a commitment, drives up the commit charge.*/
size = KIB_2_B(64);
if(NULL == (ptr(a.a_ptr,size,MEM_COMMIT,PAGE_READWRITE)))
goto LABEL_END;
a.a_committed = size;
/*Decommit, that is, return memory to the kernel explicitly.*/
VirtualFree(a.a_ptr,a.a_committed,MEM_DECOMMIT);
a.a_committed = 0;
/*Release the entire mapping.*/
VirtualFree(a.a_ptr,0 /*0, not a.a_reserved*/,MEM_RELEASE);
The advantage of reserving an obscene amount of virtual address space upfront while committing a fraction of actual memory to it is that your data will not move. There is never a need to copy old data from your old mapping into a new mapping, since the only reason that's being done is realloc not being able to extend your bucket in place. But because you own the next Tebibyte or whatever of virtual address space that the kernel is mandated not to place anything into (unless you specifically ask it to) you are guaranteed to have that virtual address space. That means that pointers pointing into that mapping never need to be updated since the underlying objects will never relocate, and your CPU caches remain unmolested - and if you can predict that you're not going to need that memory for a long time you might just as well decommit it and return it to the kernel. The only thing you have to keep track off is how much memory you committed, which can be done with a singular size_t, and how much more memory you need to commit if you happen to need it.
One important thing of note for both systems is that, unless you're on an embedded system or whatever, the pages that those allocators will return to you are already zeroed-out by the kernel (to make sure that critical information from the kernel or from another process aren't exposed to another program), so you don't need to memset anything. In that VirtualAlloc/mmap are like calloc, just without an additional memset done in userspace.
This primer is already long, and I still have the Linux section to cover, but there's three things I want you to keep in mind about this pattern. First is that Windows, which once upon a time targeted 32-bit RISC CPUs in lack of loading immediate 32-bit values as a platform to run, has an address space granularity of 64 KiBs (https://devblogs.microsoft.com/oldnewthing/20031008-00/?p=42223), which means that if you reserve or commit memory you should do it on multiples of 64 KiB (otherwise you're just wasting virtual address space, and you never know what a couple of additional bytes of memory come in handy for, and if it's only just to have some padding bytes so that you can make use of vector instructions - that's already good. That's a major constraint memcpy has to deal with). Second is that VirtualAlloc is not actually a kernel function. It has a non-trivial amount of code in userland that it uses to perform parameter verifcation and additional parameter loading to be able to call the wrapper that actually does trap into the kernel (which is NtAllocateVirtualMemory, and which almost immediately performs a mode switch into kernel), just like malloc. Unlike malloc though VirtualAlloc does not have keep track of spare chunks of memory it could provide you with in order to avoid trapping into the kernel (which is an expensive operation that can easily cost more than a couple thousand cycles). Avoid multiple commitment calls to VirtualAlloc if one suffices (you'll still need a separate call to create the initial virtual address space mapping; you can't avoid that). The third is that 2-MiB and 1-GiB-page allocations are so extensive that they're a topic for another primer, so don't try and allocate them with VirtualAlloc. It probably won't work. Trust me on that.
Now let's move on to Linux. Linux is weird. Linux in its default configuration is like a bank that hands out loans while its actual deposits are at a meager 10%; it gives out memory even if the commit charge has blown way past the amount of available memory in both RAM and swap, and just like with banks it works as long as enough processes do not access the pages that the kernel assigned to them. The moment they do it's bankrun time, and the kernel has to decide which processes get memory and which ones get killed to feed other process' memory demands. What's even worse is that this can happen at any moment - it's not like mmap will suddenly crash your process or return MAP_FAILED to indicate that it's running out of memory. No, this sort of behavior happens at the same time memory is actually committed to the address space, which, unlike Windows, does NOT just happen with another call to mmap or madvise, but during simple memory accesses. It's during these that the CPU notices that the kernel hasn't committed any memory to the address and sends a page fault to the kernel. The kernel realizes that it has made the promise for memory and is now trying to commit that memory to your addresses.
Overcommit can be disabled using the following two commands as root:
# echo "2" > /proc/sys/vm/overcommit_memory
# echo "vm.overcommit_memory=2" >> /etc/sysctl.conf
This make mmap behave more like VirtualAlloc in the sense that it will now fail if the requested amount of memory isn't available. But if you now attempt to reserve 16 TIB of virtual address space mmap will also fail, because mmap doesn't cleanly differentiate between address space reservation and memory commitment ... unless you first create a reservation using PROT_NONE and then call mmap again with the amount of memory you want to commit and the MAP_FIXED flag (and also the MAP_LOCKED flag because that prefaults as many pages as possible while the CPU is still in kernelmode; that will avoid mode switches later on). Decommit is done by overriding the committed area with PROT_NONE and MAP_FIXED again, and releasing the reservation requires a munmap.
To verify this behavior I have personally written and tested the following prototype, and checked the process' mapping status in /proc/
#include <sys/mman.h>
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <errno.h>
#define FACTOR_IEC (1024ULL)
#define KIB_2_B(b) (FACTOR_IEC * ((uintptr_t)(b)))
#define MIB_2_B(b) (FACTOR_IEC * KIB_2_B(b))
#define GIB_2_B(b) (FACTOR_IEC * MIB_2_B(b))
#define TIB_2_B(b) (FACTOR_IEC * GIB_2_B(b))
#define PIB_2_B(b) (FACTOR_IEC * TIB_2_B(b))
#define EIB_2_B(b) (FACTOR_IEC * PIB_2_B(b))
#define ZIB_2_B(b) (FACTOR_IEC * EIB_2_B(b))
#define YIB_2_B(b) (FACTOR_IEC * ZIB_2_B(b))
#define mmapm(ptr,size,prot,flags) mmap((ptr),(size),(prot),(flags),-1,0)
#define mmapm_reserve(size) mmapm(NULL,(size),PROT_NONE,MAP_PRIVATE | MAP_ANONYMOUS)
#define mmapm_commit(ptr,size) mmapm((ptr),(size),PROT_READ | PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | MAP_LOCKED)
#define mmapm_decommit(ptr,size) mmapm((ptr),(size),PROT_NONE,MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED)
#define mmapm_release(ptr,size) munmap((ptr),(size))
int main(void)
{
static const unsigned int sleep_secs = 30; /*I've added sleeps throughout the program in case you want to verify
/*the way mappings are committed and decommitted using /proc/.*/
static const size_t size_reserve = TIB_2_B(16); /*More than anyone can reasonably use.*/
static const size_t size_commit = GIB_2_B(1); /*Only to have the commit show in task manager: real commits would be smaller.*/
uint8_t*ptr_reserve;
uint8_t*ptr_commit;
int ret = 0;
/*Reservation*/
if(MAP_FAILED == (ptr_reserve = mmapm_reserve(size_reserve)))
{
ret = errno;
fprintf(stderr,"Reservation failed: %u\n",ret);
goto LABEL_END;
}
printf("PID: %u\n",getpid()); /*For /proc/.*/
printf("Reservation: %p - %p\n",ptr_reserve,ptr_reserve + size_reserve);
/*Commit*/
if(MAP_FAILED == (ptr_commit = mmapm_commit(ptr_reserve,size_commit)))
{
ret = errno;
fprintf(stderr,"Commit failed: %u\n",ret);
goto LABEL_RELEASE;
}
printf("Commit: %p - %p\n",ptr_commit,ptr_commit + size_commit);
sleep(sleep_secs);
/*Decommit*/
if(MAP_FAILED == (ptr_commit = mmapm_decommit(ptr_commit,size_commit)))
{
ret = errno;
fprintf(stderr,"Decommit failed: %u\n",ret);
goto LABEL_RELEASE;
}
printf("Commit decommitted\n");
sleep(sleep_secs);
LABEL_RELEASE:
/*Release*/
if(0 != mmapm_release(ptr_reserve,size_reserve))
{
ret = errno;
fprintf(stderr,"Release failed: %u\n",ret);
goto LABEL_END;
}
printf("Reservation released\n");
sleep(sleep_secs);
LABEL_END:
return ret;
}
Of note is that Linux has a mremap syscall whose semantics remind a lot of realloc - but unlike realloc it doesn't perform a copy from older data to a new bucket, but instead moves the underlying mapping into another region of the virtual address space so that can be more easily extended if applicable without actually copying anything. I would highly advise against using this interface, firstly because it's a syscall, secondly because altering the page table requires other locks to be gotten (remember what I said about locks protecting your page table?), thirdly because any relocation destroys the "absolute pointers that never change" pattern, fourthly because Windows doesn't have a VirtualReAlloc or something similar to emulate the behavior, but fifthly and most importantly: it's really fucking useless, isn't it, what with that 16 TiB virtual address space reservation?
Now, can't we just plug these mappings under our currently existing malloc infrastructure and call it a day? Sure we can, but we wouldn't have gained anything. The very purpose of getting rid of individual lifetime tracking is that it's the programmer who implements the lifetime by deciding when to zero the field that keeps track of the current end. If you still want individual lifetimes there's allocators like dlmalloc or jemalloc that potentially provide better performance than your standard malloc, but if you want your programs to use arenas without lifetimes you have to design your code around them. And now you know why not just the malloc implementation, but the very malloc interface is so invasive and retarded.
However that doesn't mean that you can't have ANY lifetimes in your arenas, they're just unified. I've previously mentioned that freeing an entire arena is like resetting the stack pointer register (RSP) to a known good value in the sense that you can simply decide to overwrite your old data from the beginning within the arena, without having to "free" it. In fact the idea of unifying lifetimes does come from the stack frames that are generated whenever another function is called, and destroyed whenever it returns. In fact you can stack specific ranges within your arena just like the temporary lifetime of a function frame, with the added bonus that the data beyond your "end of data" marker is not in danger of being overwritten at any time due to preemption (which writes the status of all registers beyond the stack pointer register, thus invalidating most, if not everything you may have stored beyond the pointer). Keeping track of a particular lifetime is as easy as moving the end marker (RSP) into a variable on your actual thread stack (the stack base pointer register, or RBP, which is quite often used to point towards the beginning of the current stack frame). Your arenas can be your little stack away from your actual stack, too - in fact that's a pattern I've used to store stack information for a Microsoft Windows syscall tracer. Why? Well, a lot of kernel functions are not well documented, and calling them with the wrong parameters can lead to failure or even crashes! I've seen it all!
Which begs the question - if it's possible to create stack-semantics within an arena, then is it possible to create an arena on the stack as well? Well, yes, it is - and it's also faster than using malloc or VirtualAlloc or mmap - but there are caveats. It's true that kernels often allocate much more memory for a thread stack than is ever being used, but the stack doesn't just hold temporary variables - it holds the return addresses of all function calls that the thread went through, and you need to be really careful to not accidentally use too much memory for your arena and cause a stack overflow, or, even worse, write beyond it. Your arena will also only ever be available in your current function and functions that are below yours in the call hierarchy for the very simple reason that, whenever a CALL instruction is being executed, the value of the current instruction pointer register (RIP) is placed onto the stack, whereas the RET instruction loads the value that RSP is pointing towards and loads it into RIP again. That means that RSP needs to have the same value upon exit as it did upon entering; if that is not the case, or your return address got clobbered, then a lot of bad things can happen - ranging from your program crashing because RIP is loaded with a non-executable address (best case) to your program executing code that may have been prepared by an attacker and who's now in control of your thread, and possibly your process. And since your arena will be gone by the time the function that created it returns (because everything beyond the RSP is to be considered volatile; don't even try - the kernel or signal handlers on Linux or APCs on Windows always win) you want to create your arena as early as possible on top of being really careful not to overflow it!! Don't say I didn't warn you!
Let me also address some arguments that distractors have levied against this primer:
malloc is fast enough.
See my comment about the multithreaded application. Of 24 threads only 6 were running because they were the only ones that would ever get a hold of the lock, and the other 18 were just stalling and waiting for memory.
arenas are only for performance-critical programs.
So I don't know if you've noticed, but the hardware industry has provided the software industry with orders of magnitude of performance increases over the last thirty years; our mass storage devices are bigger and faster than they've ever been, our processors have more threads and can execute more instructions per clock than they ever could, RAM has gotten to the point where HDDs of 30 years ago wouldn't have been able to store 32 GiB, yet alone 64. Do you feel that increase? Do our machines boot in two seconds? How long does your favorite vidya load? Did you know that back in 2021 someone did a deep-dive on the loading times of GTA V, in particular the online mode (which took 6 minutes on his machine), and when he actually looked at what took it so long it turned out that the JSON parser loading in the shop catalog information was so shoddy that the game spent most of its time in strlen? I'm not making this shit up: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ GTA V came out in, what, 2013? And in 8 years no one at Rockstar was competent enough to profile the game and determine what it was that the CPU was doing. Most people don't know how to write programs anymore. Don't be like most people.
you can just use a better implementation of malloc, like dlmalloc or jemalloc, and then it'll be all good again!
If you need lifetimes, sure. If you don't, then they might be adding a slightly more efficient layer of bullshit between you and the kernel, but the layer of bullshit is still present. The fastest code is code that doesn't run.
your argument about realloc is invalid, people don't use it all that often.
Explicitly, sure. Once upon a time I was working for a company with a sizable Perl online codebase. One afternoon one of our junior devs came to me and told me that one of our bigger customers had called, the upload of about ten thousand records would take over an hour. The junior dev had already looked at the code, but didn't see anything out of the ordinary and asked me if I could look at it, and what I noticed was that the code was appending the value of every field of every record to a buffer, which, by the end of it, was several dozen MiBs in size. Every time another field would be appended the interpreter was calling realloc internally, and, somehow failing to resize the field in place, created a new bucket for the old data plus the new field, copied the old data into the new bucket, copied the field to the end of the old data, and freed the old bucket. Then it appended the next field, repeating the process. And then the next one. And the next one. I don't know if Convert::Scalar existed back in the day, which would've allowed us to preallocate the entire thing at the beginning of the process, but it's what I would do today. What I did back in the day was to create a second variable into which all the fields for the record were appended first, before appending the record to the rest of the file. Execution time went from 70 minutes to 10 minutes, and the customer thanked us for the optimization. I stand by my statements.
doesn't place buckets right next to buckets (anymore), but creates individual mappings for bigger allocations
... and? You still have the issue of individual lifetimes per allocation, resizing is still 72 bitches and a pain in the arse (because it involves a lookup if the mapping can grow in place, and if it doesn't it requires a relocation, and even if you avoid the userspace copy that way because you happen to be on a system with mremap or an equivalent that's still a bunch of locks the kernel has to acquire for you), and to top it off memory fragmentation has gone up as cache utilization has gone down. And this really is a better topic for the CPU cache primer, but (modern) caches have associativity based on their virtual address, meaning that the longer the gaps of unused virtual address space there is the more likely it becomes that a cache line holding hot data will be evicted while the cold one right next to it remains unused. Why? Because the addresses that THAT particular cache line covers mostly point to the avoid of unmapped addresses, and even if it wasn't (because you're on a machine that uses physical indices for whatever reason) hardware prefetchers operate on virtual addresses and don't cross page boundaries. With the reserve-high/commit-low pattern you make sure that there are as few "holes" in virtual address space as possible, thus giving you at least a chance at proper cache utilization. Relocations also don't happen.
my code simply doesn't work with just one lifetime!
Reserve-high/commit-low doesn't mean you have one lifetime, but that you have unified lifetimes. You can still decide which sections are being used for what, and can implement stack semantics without much costs (keeping the end of the current lifetime in a variable on the thread stack). And if everything else fails you can still use separate arenas with special lifetimes.