Table of contents
Open Table of contents
- The Arena’s Edge Case
- Act 1: The Naive Pool and the Fragmentation Problem
- Act 2: Solving Fragmentation with a Free List
- Act 3: The Hidden Flaw - The Iteration Problem
- Act 4: The Packed Array - A Quest for Perfect Iteration
- Act 5: The Unstable Pointer Crisis
- Act 6: The Final Architecture - Generational Handles
- The Performance Showdown: By the Numbers
- Set 1: High Churn (95% Free) - 10,000 Allocs, 9,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
- Set 2: Low Churn (5% Free) - 10,000 Allocs, 500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
- Set 3: High Alloc/Free Cost - 100,000 Allocs, 99,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
- Set 4: High Iteration Cost - 10,000 Allocs, 9,500 Frees, 100,000 Iterations, 500 Churn/1,000 Frames
- Conclusion: Mastering Memory
The Arena’s Edge Case
In Part 2, we forged the Arena: a tool of immense power based on a simple philosophy: allocate with lightning speed by bumping a pointer, and free everything all at once. This is perfect for short-lived data or for data that shares a collective lifetime. But the Arena, like any specialist, has its edge cases. What if your system’s lifetime is long, yet its contents churn constantly, like particles spawning and dying by the thousands every second?
This calls for a new philosophy: specialization. We will engineer a Pool Allocator, designed to manage the rapid allocation and deallocation of objects that are all the same size. Our journey will take us through distinct designs, from simple to sophisticated. But remember, “better” is relative, each step solves specific problems, but the right choice depends on your exact circumstances, like how often you iterate vs. how much you need stable references.
Act 1: The Naive Pool and the Fragmentation Problem
Let’s start with the most basic possible implementation. A “pool” is just a collection of fixed-size slots. We can create this by taking a large chunk of memory from our arena and simply deciding to treat it as an array of Particle
objects.
struct Particle {
bool is_active;
float position[2];
float velocity[2];
float lifetime_remaining;
};
// For now, the pool just tracks how many slots we've ever used.
struct ParticlePool {
Arena* backing_arena;
size_t capacity; // How many slots we can iterate over
};
void InitPool(ParticlePool* pool, Arena* backing_arena) {
pool->backing_arena = backing_arena;
pool->capacity = 0;
}
// Allocation is just a bump in the arena.
Particle* PoolAlloc(ParticlePool* pool) {
pool->capacity++;
return (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
}
This works for spawning our initial burst of particles. But the moment a particle’s lifetime_remaining
hits zero, we have a problem. What do we do? We can set its is_active
flag to false
, but its memory slot is still occupied. If we just call PoolAlloc
again, it will give us a brand new slot from the arena, leaving the old one empty and unusable.
This creates “holes” in our memory block. This problem is a form of internal fragmentation: the memory is allocated to our pool, but it’s not being used for any active objects.
We’ve got a basic pool, but fragmentation is wasting memory. Time to recycle…
Act 2: Solving Fragmentation with a Free List
The solution to our fragmentation problem is obvious: we need to recycle the memory. When a particle is freed, we should add it to a list of available slots. The next time PoolAlloc
is called, it should check this free list first.
Step 1: The External List Idea
Our first thought might be to create a separate, external list to track the free slots.
// A thought experiment - don't do this!
std::vector<Particle*> free_particle_pointers;
When we free a particle, we push_back
its pointer onto this vector. When we allocate, we check if the vector is empty. If not, we pop a pointer off the back and use it. This works, but it feels wrong. We’re trying to build a lean, screaming-fast allocator, and we’ve just introduced the overhead of a complex data structure that does its own heap allocations to manage it. We are defeating the purpose.
We can do better. We can do it with zero extra memory. But how? Where could we possibly store the pointers for our linked list without malloc
ing a single extra byte?
Step 2: The Zero-Overhead Trick
Let’s look closely at a “dead” particle. We’ve set its is_active
flag to false
. What about the rest of its data? Its position
, velocity
, and lifetime_remaining
are now completely useless. They are just bytes occupying space, waiting to be overwritten. This is wasted space.
What if we could repurpose it?
What if, the moment a particle becomes inactive, we could treat the memory that used to hold its float data as a place to store a single pointer, a pointer to the next free slot in our list?
This is not just a hypothetical idea. The perfect tool for this job is the union
. A union
is a special kind of struct
where all members occupy the same memory location. By placing our active data and our next_free
pointer inside a union, we are explicitly telling the compiler that these two things are mutually exclusive: a particle’s memory is either used for game data or it’s used as a link in the free list, but never both at the same time.
This is the magic trick. We will build our free list inside the memory of the dead objects themselves.
// The new, improved Particle struct
struct Particle {
bool is_active; // Always accessible
union {
// State when is_active = true
struct {
float position[2];
float velocity[2];
float lifetime_remaining;
} data;
// State when is_active = false
Particle* next_free;
};
};
Now, the memory of a dead particle can be repurposed to become a node in a singly-linked list. Our pool only needs to store a single pointer: the head of this list. This is a true zero-overhead free list.
Step 3: Implementation
With this clever structure in place, the implementation becomes beautifully simple.
struct ParticlePool {
Arena* backing_arena;
Particle* free_list_head;
size_t capacity;
};
// InitPool remains mostly the same, just initialize the head pointer.
void InitPool(ParticlePool* pool, Arena* backing_arena) {
pool->backing_arena = backing_arena;
pool->free_list_head = nullptr;
pool->capacity = 0;
}
void PoolFree(ParticlePool* pool, Particle* p) {
if (!p) return;
p->is_active = false;
// Prepend the freed particle to the free list.
p->next_free = pool->free_list_head;
pool->free_list_head = p;
}
Particle* PoolAlloc(ParticlePool* pool) {
// First, try to recycle from the free list.
if (pool->free_list_head) {
Particle* recycled = pool->free_list_head;
pool->free_list_head = recycled->next_free; // Pop from list
recycled->is_active = true;
return recycled;
}
// If the free list is empty, get fresh memory from the arena.
pool->capacity++;
Particle* new_particle = (Particle*)ArenaAlloc(pool->backing_arena,
sizeof(Particle));
if (new_particle) {
new_particle->is_active = true;
}
return new_particle;
}
This design is a huge improvement. Allocation and deallocation are incredibly fast. It provides pointers that are stable as long as the object isn’t freed when PoolAlloc
returns a Particle*
, that pointer’s value is fixed and will remain valid until PoolFree
is called on it. Other systems can safely store this pointer, but beware: if you hold onto the pointer after freeing and the memory is reused for a new particle, it will point to a different object. This is a potential source of bugs if not managed carefully.
We’ve solved fragmentation, but is this always “better”? Not necessarily, if your system doesn’t have high churn or if iteration isn’t a bottleneck, the naive pool might be simpler and sufficient. But for many cases, this free list is a solid step up… until we hit the next hurdle.
Act 3: The Hidden Flaw - The Iteration Problem
We have solved the recycling problem, but we have created a new, subtle performance issue, one that may or may not matter depending on your workload. For a particle system, the single most common operation is iterating over all active particles every frame to update their position and check their lifetime.
With our current design, how do we find all the active particles? They are scattered throughout the memory block allocated by the arena. We have no choice but to walk the entire block from start to finish and check each one.
Particle* all_particles = (Particle*)pool->backing_arena->base_ptr;
for (size_t i = 0; i < pool->capacity; ++i) {
Particle* p = &all_particles[i];
if (p->is_active) { // <-- This check is the performance killer
// ... update and render ...
}
}
If our pool has a capacity of 10,000 but only 500 particles are active, we are wasting 95% of our loop’s time just checking a flag. This is death by a thousand cuts. The unpredictable if
statement can also cause CPU branch mispredictions, stalling the processor’s pipeline and making the problem even worse.
We have traded good iteration performance for easy allocation. This is a bad trade if iteration is your hot path, but if your system prioritizes fast alloc/free over frequent loops (e.g., sparse access patterns), the free list might still be ideal. In data-oriented design, we must optimize for the most common access pattern, and if that pattern is iteration, we need a different approach.
Act 4: The Packed Array - A Quest for Perfect Iteration
Our new goal is to design a data structure where all active objects are packed tightly together in one contiguous block. This would allow for a simple, cache-perfect loop with no wasted checks, if iteration is indeed your bottleneck.
This technique is called a Packed Array, and it’s maintained with a clever removal strategy often called “swap-and-pop”.
The rule is simple and absolute: All active particles live in a contiguous block at the beginning of the array, from index 0
to active_count - 1
.
- Allocation: Is trivial. We just hand out the particle at
active_count
and then increment the counter. - Deallocation: This is the clever part. To free a particle at
index i
, we don’t create a hole. We take the particle from the very end of the active block (active_count - 1
) and move its data into the slot atindex i
. Then we simply decrementactive_count
.
The hole is instantly filled, and the contiguous block is preserved. The update loop becomes a thing of beauty:
// Perfect iteration. No branches, perfect cache locality.
for (size_t i = 0; i < pool.active_count; ++i) {
Particle* p = &pool.particles[i];
// ... update and render ...
}
However, this deallocation step involves copying the entire particle data from the end to the hole. If your particles are small, this is fine, but if they’re large or if you have extremely high churn (e.g., particles with very short lifetimes, dying and spawning thousands per frame), the repeated copying can add up to significant overhead. In such cases, the performance gain in iteration might be offset by slower deallocations, making this less ideal for ultra-short-lived objects compared to the free list’s near-zero-cost frees.
This is a performance win for iteration-heavy systems, but it’s not universally “better”, it complicates deallocation and introduces a new issue we’ll see next. If your use case involves infrequent iteration or needs stable pointers more than speed, stick with the free list.
Act 5: The Unstable Pointer Crisis
We have achieved the perfect loop. We have eliminated wasted work and maximized cache performance. But in doing so, we have introduced a new, far more dangerous problem, one that could make this design unsuitable for systems requiring reliable external references.
Consider this scenario:
- The UI system wants to draw a special icon over a specific particle, so it asks for and stores a pointer:
Particle* p_special = &pool.particles[5];
- A moment later, the physics system determines that particle
5
has hit a wall and callsPoolFree(pool, 5)
. - The pool, following the swap-and-pop rule, copies the data from the last particle (say, at index
99
) into slot5
. - Now, when the UI system dereferences its
p_special
pointer, it is no longer pointing to the particle it was given. It’s pointing to a completely different particle that just happens to occupy the same memory address.
This is a dangling pointer. We have solved the performance problem but created a correctness nightmare that can lead to silent data corruption, impossible-to-debug visual glitches, or outright crashes. We cannot safely pass raw pointers to objects in a packed array.
This is the ultimate architect’s dilemma: we have a conflict between iteration performance and reference stability. If your system doesn’t need external pointers (e.g., all logic is internal), the packed array is fantastic. But if stability is key, we need one more layer.
Act 6: The Final Architecture - Generational Handles
The answer is to introduce a layer of indirection. Instead of giving other systems a direct pointer (an object’s home address), we will give them a Handle (an object’s unique ID card). This handle can then be used to safely look up the object’s current address, if your problem demands both fast iteration and stable references.
Let’s build this step by step, starting with a simple example to illustrate the concepts.
Step 1: The Indirection Table
First, let’s understand why we need indirection. In the packed array, particles move around during deallocations, their positions (indices) in the array change due to swap-and-pop. A direct pointer like &pool.particles[5]
becomes invalid if something else gets swapped into index 5.
To fix this, we decouple the “identity” of a particle from its physical location. When we allocate a particle, we assign it a stable ID, a unique number that never changes, even if the particle’s data moves in the array. Think of the ID as a permanent name tag.
Now, to find a particle’s current location, we need a way to look up “Where is ID 42 right now?” That’s where the indirection table comes in. We’ll use an array called sparse_to_dense
: the index of this array is the ID, and the value stored there is the current index in the packed array.
For example, suppose we have a small pool with capacity 4:
- We allocate Particle A, assign it ID 0, place it at packed index 0:
sparse_to_dense[0] = 0
.
- Allocate Particle B, ID 1, at index 1:
sparse_to_dense[1] = 1
.
- Allocate Particle C, ID 2, at index 2:
sparse_to_dense[2] = 2
.
If we free Particle B (ID 1, at index 1), swap-and-pop moves Particle C into index 1 and reduces active_count to 2. We update the table: sparse_to_dense[2] = 1
(C is now at 1).
But wait, during the swap, how do we know which ID was at the end to update its entry? We need a reverse mapping: another array called dense_to_sparse
, where the index is the packed position, and the value is the ID.
Before the free: dense_to_sparse[0] = 0
(A), [1] = 1
(B), [2] = 2
(C).
After swap: Move C’s data to index 1, then set dense_to_sparse[1] = 2
, and update sparse_to_dense[2] = 1
.
This two-way mapping ensures we can quickly find and update locations without slow searches. It’s like a phone book (sparse_to_dense for ID to address) and a reverse directory (dense_to_sparse for address to ID).
This indirection adds a small cost (array lookups) but allows handles to remain valid even as indices change.
Step 2: The Stale Handle Problem - Using Generations
With just IDs and tables, there’s still a risk. Suppose we free particle A (ID 0), making ID 0 available. Then we allocate a new particle C, which reuses ID 0. An old handle with ID 0 now incorrectly points to C instead of the original A!
This is the “stale handle” issue, reused IDs can fool old handles.
To solve this, we add a generation count to each ID, stored directly in the sparse array alongside the dense index. Think of it as a version number for each slot.
- When we free a slot (sparse index 0, current generation 0), we increment its generation to 1 in the sparse array.
- A handle stores both the sparse index and the generation at creation:
{index: 0, generation: 0}
. - When reusing the slot for a new particle C, it keeps the updated generation 1, so handles for C are
{index: 0, generation: 1}
. - To use a handle, check if its generation matches the current generation in the sparse array for that index. If not (e.g., old handle has gen 0, but sparse has gen 1), it’s stale and invalid.
This prevents reuse bugs. For example, the old handle for A fails the check, while new ones for C succeed.
Step 3: The Final Implementation
Putting it all together, this is the complete solution, often called a Slot Map. It gives us the best of all worlds: cache-perfect iteration and completely safe, stable external references… for problems that need both. If your system is simpler, earlier designs might be lighter and sufficient.
Here’s a simplified overview:
// Handle stores sparse index and generation
struct ParticleHandle {
uint32_t sparse_index;
uint32_t generation;
};
// Sparse entry holds current dense index and generation
struct SparseEntry {
uint32_t dense_index; // Current position in packed array
uint32_t generation; // Version number
};
struct ParticlePool {
// ... packed particle_data array, active_count, etc.
uint32_t* dense_to_sparse; // Packed index -> sparse index
SparseEntry* sparse; // Sparse index -> {dense_index, generation}
ParticleHandle Alloc() {
// Find a free sparse index (e.g., next available or from free list)
uint32_t sparse_id = get_free_sparse_id();
// Place in packed array
uint32_t dense_id = active_count++;
sparse[sparse_id].dense_index = dense_id;
dense_to_sparse[dense_id] = sparse_id;
// Use current generation (increment if needed for reuse)
uint32_t gen = sparse[sparse_id].generation;
// Initialize particle data at dense_id
// ...
return {sparse_id, gen};
}
void Free(ParticleHandle handle) {
if (!IsValid(handle)) return;
uint32_t dense_id = sparse[handle.sparse_index].dense_index;
uint32_t last_dense = --active_count;
// Swap-and-pop: Copy last particle's data to freed spot
particle_data[dense_id] = particle_data[last_dense];
// Update mappings for the moved particle
uint32_t moved_sparse = dense_to_sparse[last_dense];
sparse[moved_sparse].dense_index = dense_id;
dense_to_sparse[dense_id] = moved_sparse;
// Invalidate: Increment generation and mark as free
sparse[handle.sparse_index].generation++;
sparse[handle.sparse_index].dense_index = INVALID; // e.g., UINT32_MAX
}
bool IsValid(ParticleHandle handle) {
if (handle.sparse_index >= capacity) return false;
return sparse[handle.sparse_index].generation == handle.generation &&
sparse[handle.sparse_index].dense_index != INVALID;
}
Particle* Get(ParticleHandle handle) {
if (!IsValid(handle)) return nullptr;
uint32_t dense_id = sparse[handle.sparse_index].dense_index;
return &particle_data[dense_id];
}
};
The Performance Showdown: By the Numbers
Now that we’ve built our arsenal of pool designs, let’s put them to the test. We’ll simulate a particle system under stress, measuring three key phases:
- Alloc/Free Time: The cost of initial allocations and deallocations.
- Churn Time: Simulating dynamic gameplay by replacing 500 particles over 1,000 “frames” (frees followed by allocs), this tests real-world object turnover. (Note: Times here are purely the accumulative cost of alloc/free calls, excluding setup like shuffling.)
- Iteration Time: Updating all active particles a set number of times, mimicking per-frame logic.
These benchmarks were run with varying total allocations, free rates, and iteration counts to expose each design’s strengths and weaknesses. We’ll walk through each test set, highlighting what the numbers reveal about trade-offs like cache efficiency, allocation speed, and iteration bottlenecks.
Ran on a M4 Pro 12-Core Macbook
Set 1: High Churn (95% Free) - 10,000 Allocs, 9,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
This set models a sparse system with lots of frees (e.g., short-lived particles exploding on impact).
Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
---|---|---|---|
Naive Pool | 12.99 | 35.91 | 2.79 |
Free List Pool | 0.17 | 0.44 | 5.48 |
Packed Array | 0.27 | 0.39 | 0.31 |
Slot Map | 0.15 | 1.19 | 0.31 |
Insights:
- The Naive Pool struggles with alloc/free (12.99 ms) due to its O(N) scan for free slots. It’s repeatedly combing through nearly 10,000 elements.
- Yet, its iteration shines (2.79 ms, faster than Free List) thanks to front-filling, which clusters actives at the array’s start for better cache hits despite the
if
branches. - Free List dominates churn (0.44 ms total for 500,000 operations) with pure O(1) efficiency, but iteration lags (5.48 ms) from scattered actives causing cache misses.
- Packed Array’s churn is equally impressive (0.39 ms), showing swap-and-pop copies are negligible for small structs.
- Slot Map adds a small churn overhead (1.19 ms) from indirection, but matches Packed’s blazing iteration (0.31 ms) while providing safe handles.
Set 2: Low Churn (5% Free) - 10,000 Allocs, 500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
Here, we have a dense system with few frees (e.g., long-lived entities like NPCs).
Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
---|---|---|---|
Naive Pool | 12.94 | 613.37 | 6.51 |
Free List Pool | 0.13 | 0.51 | 7.63 |
Packed Array | 0.15 | 0.80 | 6.92 |
Slot Map | 0.15 | 1.69 | 7.00 |
Insights:
- Naive’s churn balloons to 613 ms in this dense setup: each alloc must scan almost the full array to find rare free slots, making it painfully inefficient.
- Its iteration holds up well (6.51 ms) due to clustering, but with 9,500 actives, the
if
checks become less of a bottleneck. - Free List and Packed Array breeze through churn (under 1 ms total), as frees quickly create reusable slots without heavy costs.
- Slot Map’s churn is a tad higher (1.69 ms) from handle and generation management.
- Iteration sees packed designs pull ahead slightly, benefiting from high active density, while Free List trails due to minor scattering from churn.
Set 3: High Alloc/Free Cost - 100,000 Allocs, 99,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
Scaling up to test raw alloc/free scalability (e.g., massive simulations).
Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
---|---|---|---|
Naive Pool | 1244.46 | 34.84 | 25.13 |
Free List Pool | 1.41 | 0.46 | 32.69 |
Packed Array | 2.73 | 0.39 | 0.31 |
Slot Map | 1.62 | 1.19 | 0.31 |
Insights:
- Naive buckles under the scale (1,244 ms alloc/free) with its O(N²)-esque scanning, completely unusable for massive pools.
- Free List and Slot Map scale effortlessly with O(1) operations, keeping times low.
- Packed’s initial frees take a hit (2.73 ms) from numerous swaps, but its churn stays minimal (0.39 ms).
- Iteration amplifies the packed advantage (0.31 ms vs. 25-33 ms), as they process only 500 actives while others slog through 100,000 slots.
- With abundant free slots post-initial frees, churn remains tiny for all, but Free List and Packed edge out with sub-0.5 ms totals.
Set 4: High Iteration Cost - 10,000 Allocs, 9,500 Frees, 100,000 Iterations, 500 Churn/1,000 Frames
Cranking iterations to extremes (e.g., stress-testing update loops).
Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
---|---|---|---|
Naive Pool | 12.78 | 35.44 | 273.18 |
Free List Pool | 0.15 | 0.43 | 542.18 |
Packed Array | 0.26 | 0.38 | 30.79 |
Slot Map | 0.15 | 1.17 | 30.62 |
Insights:
- Iteration dominates, magnifying gaps: Packed and Slot Map wrap up in ~30 ms (handling 500 actives 100,000 times), while Naive (273 ms) and Free List (542 ms) burn time on empty slots.
- Naive’s clustering pays off again, outpacing Free List in iteration, its naive strategy accidentally optimizes for density.
- Churn aligns with other sets (all under 2 ms except Naive’s 35 ms), underscoring that for iteration-heavy workloads, packed designs deliver 9-18x speedups.
- Free List’s churn is stellar (0.43 ms), but if looping is your hot path, the cache costs make it lag.
What We’ve Learned from the Numbers
The big lesson? There’s no universal “best”, it hinges on your priorities. High churn? Lean toward Free List for raw efficiency. Iteration-heavy? Packed or Slot Map will transform your frame times. Need safety too? Slot Map reigns supreme as the balanced powerhouse. Always benchmark in your real app; these tests are a starting point, but your data patterns might flip the script!
Conclusion: Mastering Memory
We have journeyed from the raw, untamed power of the arena to the precision of the specialist’s toolkit. In doing so, we have treated memory not as a commodity to be consumed, but as a medium to be sculpted. The free list, the packed array, and the slot map are not merely clever algorithms; they are expressions of a fundamental principle: the structure of your data and the structure of your problem must be in harmony.
This is the core of data-oriented design. It demands that we look past the convenient abstractions of general-purpose allocators and engage directly with the realities of the hardware: with cache lines, pointer indirection, and the relentless cost of a branch misprediction. The resulting systems are not just faster; their performance is more predictable, more scalable, and more robust, because their efficiency is an intrinsic property of their architecture, not an optimization applied as an afterthought. The true magic lies not in any single trick, but in the mastery of these underlying mechanics, allowing us to build solutions that are perfectly tailored to the task at hand.