Table of contents
Open Table of contents
The End of the Line
A common performance problem lives in the heart of systems that stream data: audio engines, network stacks, logging systems, and inter-thread message queues. It’s the problem of the circular buffer, also known as a ring buffer. And its problem is the “end of the line.”
The Classic Ring Buffer and its Two-Copy Problem
A ring buffer is a brilliant data structure for producer-consumer scenarios. One thread, the producer, writes data into the buffer, and another thread, the consumer, reads it. It’s a fixed-size block of memory that cleverly pretends to be infinite by wrapping around on itself.
This works beautifully, until the head
pointer reaches the end of the allocated memory block. At that point, it must wrap back around to the beginning.
This wrap-around requirement complicates what should be a simple write operation. If a write operation crosses the buffer’s boundary, it must be split. The programmer must calculate the size of the chunk to the end of the buffer, perform one memcpy
, then calculate the size of the remaining data and perform a second memcpy
to the start of the buffer.
void WriteToBuffer(RingBuffer* rb, const void* data, size_t size) {
// ... space check omitted ...
size_t current_head = rb->head;
if (current_head + size <= rb->capacity) {
// Case 1: The write fits in one contiguous block.
memcpy(rb->buffer + current_head, data, size);
} else {
// Case 2: The write wraps around. It must be split.
size_t first_chunk_size = rb->capacity - current_head;
memcpy(rb->buffer + current_head, data, first_chunk_size);
size_t second_chunk_size = size - first_chunk_size;
memcpy(rb->buffer, (char*)data + first_chunk_size, second_chunk_size);
}
// Finally, update the head pointer, wrapping it around with modulo.
rb->head = (current_head + size) % rb->capacity;
}
This conditional logic is a performance killer. The if
statement is a prime candidate for branch misprediction, which stalls the CPU. Furthermore, the API is messy. You can no longer simply ask for a pointer to a contiguous block of memory to write into, the logic must handle the possibility of two disjoint blocks for a single logical write.
What if we could eliminate the cliff? What if we could create a buffer that was, for all intents and purposes, truly linear?
The Virtual Memory Trick: A Mirrored Reality
The solution lies in using our deep knowledge of virtual memory to create a new reality. The trick is elegant:
We will ask the OS to map the same block of physical RAM to two contiguous regions of virtual address space.
The result is a virtual buffer where the second half is a perfect, byte-for-byte mirror of the first half. Writing to buffer[0]
also writes to buffer[capacity]
. They are the same memory.
This technique is most practical in 64-bit applications where virtual address space is plentiful. The alignment requirements for the buffer size are specific to each operating system, as we’ll see below.
Seeing the Magic in Action
With this buffer, the two-copy problem vanishes. We can perform a single memcpy
that seamlessly wraps around the buffer boundary, with no special logic.
The following code demonstrates this. We write a string that is intentionally placed to cross the boundary of the physical memory.
const size_t requested_size = 65536;
char* buffer = CreateMirroredRingBuffer(requested_size);
const char message[] = "HELLO!";
size_t message_len = sizeof(message) - 1;
size_t start_index = requested_size - 3; // Position write to force a wrap-around.
// This single memcpy will cross the boundary.
memcpy(&buffer[start_index], message, message_len);
// We can now read the entire, contiguous message by pointing
// to the start of the write in our "virtual" buffer.
printf("Contiguous message after wrap-around: %.*s\n",
(int)message_len, &buffer[start_index]);
The output proves that the single memcpy
worked as if the buffer were truly linear. The memory region starting at start_index
now contains the full, unbroken message, with the wrap-around handled transparently by the hardware.
Contiguous message after wrap-around: HELLO!
Now that we’ve seen what this buffer can do, let’s look at how to build it on different platforms.
Building the Mirrored Buffer on Windows
The modern Windows implementation revolves around VirtualAlloc2
and MapViewOfFile3
, which allow us to create and map into memory “placeholders” in a race-free way. The process is more nuanced than it first appears.
Step 1: Align Size to Granularity
On Windows, the size of our buffer must be a multiple of the system’s allocation granularity (typically 64KB). We get this value from GetSystemInfo
and round our requested size up to the next valid multiple.
SYSTEM_INFO sysInfo;
GetSystemInfo(&sysInfo);
const size_t granularity = sysInfo.dwAllocationGranularity;
// Round the requested size up to the nearest multiple of granularity.
size_t aligned_size = (size + granularity - 1) & ~(granularity - 1);
Step 2: Create a Page-File-Backed Memory Section
We use CreateFileMapping
to ask the OS for a chunk of committable memory. Because this is an older Win32 API, we must manually split our 64-bit aligned_size
into high and low 32-bit parts for the function arguments.
// Split the 64-bit size for the CreateFileMapping function.
DWORD size_high = (DWORD)(aligned_size >> 32);
DWORD size_low = (DWORD)(aligned_size & 0xFFFFFFFF);
HANDLE mapping = CreateFileMapping(INVALID_HANDLE_VALUE, NULL,
PAGE_READWRITE, size_high, size_low, NULL);
Step 3: Reserve a Combined Virtual Address Placeholder
We reserve a single, contiguous virtual address range that is twice the size of our aligned buffer. This is the canvas we will work on.
void* placeholder = VirtualAlloc2(GetCurrentProcess(), NULL, 2 * aligned_size,
MEM_RESERVE | MEM_RESERVE_PLACEHOLDER,
PAGE_NOACCESS, NULL, 0);
Step 4: Split the Placeholder
This is a crucial and non-obvious step. MapViewOfFile3
requires the placeholder it replaces to be of the exact same size as the view being mapped. We cannot map a size
-byte view into a 2*size
-byte placeholder.
To solve this, we perform a split. By calling VirtualFree
on the second half of the reservation with the special MEM_PRESERVE_PLACEHOLDER
flag, we instruct the OS to redefine the single large placeholder as two smaller, independent placeholders. This single call is all that’s needed to prepare our address space.
// Split the 2*size placeholder into two separate 'size' placeholders.
VirtualFree((char*)placeholder + aligned_size, aligned_size,
MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER);
Step 5: Map the Memory Section into Each Placeholder
Now that we have two correctly-sized placeholders, we can map our memory section into each one using MapViewOfFile3
with the MEM_REPLACE_PLACEHOLDER
flag. This atomically consumes the placeholder and replaces it with our mapping.
// Map the first half into the first placeholder.
void* view1 = MapViewOfFile3(mapping, GetCurrentProcess(), placeholder,
0, aligned_size, MEM_REPLACE_PLACEHOLDER,
PAGE_READWRITE, NULL, 0);
// Map the second half into the second placeholder.
void* view2 = MapViewOfFile3(mapping, GetCurrentProcess(),
(char*)placeholder + aligned_size, 0, aligned_size,
MEM_REPLACE_PLACEHOLDER,
PAGE_READWRITE, NULL, 0);
If both calls succeed, we have our mirrored buffer. The mapping
handle can now be closed, as the views will keep the memory alive.
A Note on Older Windows Versions: The
VirtualAlloc2
andMapViewOfFile3
functions are not available on older systems. On those platforms, you must use a fallback method that involves reserving a2*size
block withVirtualAlloc
, immediately freeing it withVirtualFree
, and then attempting to map both halves into the now-vacant address range withMapViewOfFileEx
. This older method has an inherent race condition: another thread could allocate memory in that address range between theVirtualFree
andMapViewOfFileEx
calls. Robust code for older systems must put this logic in a retry loop.
Building the Mirrored Buffer on POSIX (Linux/macOS)
The POSIX implementation uses mmap
for everything, but first requires a sharable file descriptor.
Step 1: Round Up to Page Size
On POSIX systems, the mmap
function requires that the size of a mapping be a multiple of the system’s page size (typically 4KB). We get this value from sysconf
and round our requested size up to the next valid multiple.
long page_size = sysconf(_SC_PAGESIZE);
// Round the requested size up to the nearest multiple of the page size.
size_t aligned_size = (size + page_size - 1) & ~((size_t)page_size - 1);
Step 2: Get a Sharable File Descriptor
We need a file descriptor (fd
) that represents the physical memory we want to map. The best way to get one is with memfd_create
on Linux, as it creates a truly anonymous in-memory file that won’t conflict with anything on the filesystem. For portability with systems like macOS, we must fall back to shm_open
.
shm_open
requires a unique name. We will generate one using the process ID and a counter, and retry if a name collision occurs.
int fd = -1;
#if __linux__
// memfd_create is preferred: no name collisions and no filesystem presence.
fd = syscall(SYS_memfd_create, "ring-buffer", MFD_CLOEXEC);
#endif
if (fd == -1) {
// shm_open is the POSIX fallback for macOS and older Linux.
char path[256];
int retries = 100;
do {
// Generate a unique name.
snprintf(path, sizeof(path), "/ring-buffer-%d-%d", getpid(), retries);
fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL | O_CLOEXEC, 0600);
retries--;
} while (fd < 0 && errno == EEXIST && retries > 0);
if (fd < 0) return nullptr;
// Immediately unlink the path. The memory object will persist
// until the last fd is closed, ensuring automatic cleanup.
shm_unlink(path);
}
// Set the size of the memory object.
ftruncate(fd, (off_t)aligned_size);
Step 3: Reserve a Contiguous Virtual Address Range
The core of the trick is getting two contiguous virtual memory blocks. The safest way to do this is to ask the OS to reserve a single, larger block that is twice the size of our desired buffer. We use mmap
with PROT_NONE
to reserve the address range without allocating any physical memory for it yet. This prevents other threads from stealing our address space before we can map into it.
void* placeholder = mmap(nullptr, 2 * aligned_size, PROT_NONE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
Step 4: Map the File Descriptor Twice
Finally, we use mmap
again to map our file descriptor from Step 2 into the two halves of the placeholder we reserved in Step 3. The MAP_FIXED
flag tells mmap
to use the exact address we provide. It is crucial to also use MAP_SHARED
, which ensures that writes to one mapping are visible in the other.
// Map the first half.
void* view1 = mmap(placeholder, aligned_size, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
// Map the second half.
void* view2 = mmap((char*)placeholder + aligned_size, aligned_size,
PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, fd, 0);
If both calls succeed, the file descriptor fd
can be safely closed. The kernel maintains a reference to the underlying memory object for as long as the mappings exist.
A Note on POSIX Gotchas:
shm_open
Naming: Theshm_open
fallback uses filesystem-like names that can collide between processes. Our retry loop with a unique name helps prevent this, but production code might use an even more robust random name generation strategy.MAP_SHARED
is Mandatory: The final twommap
calls must useMAP_SHARED
to ensure writes to one view are visible in the other, creating the mirror. UsingMAP_PRIVATE
would break the illusion due to copy-on-write semantics.
Taking it Further
This virtual memory trick is a powerful building block for a variety of high-performance systems. With the core problem of wrap-around logic eliminated, you can focus on other challenges. Here are a few things you could build:
- Lock-Free SPSC/MPSC Queues: This is the perfect foundation for single-producer or multi-producer queues, where the producer(s) can write data of any size without complex logic or locking.
- Real-Time Audio Buffers: In audio processing, you often need to apply a DSP filter to a sliding window of sample data. A mirrored buffer allows you to get a single pointer to this window, even if it spans the “end” of the buffer, avoiding branches in your tight audio callback loop.
- Network Packet Assembly: When receiving network data, you can write incoming fragments directly into the buffer. A single
send
or processing call can then operate on a complete, contiguous packet once all its bytes have arrived, without any intermediate copying. - Game Engine Command Buffers: A renderer’s command buffer can be implemented as a ring buffer. With this trick, the CPU can generate rendering commands in a tight, branch-free loop, writing them into the buffer for the GPU to consume.
Conclusion: From Trick to Tool
With this technique, we have transformed our ring buffer. The code that uses it becomes dramatically simpler. A Write
operation is no longer a conditional dance of two memcpy
calls, it’s a single, clean, branch-free operation.
Of course, there is a trade-off. We are trading virtual address space and a few extra TLB entries for a branch-free hot path. For many streaming applications, where the buffer write/read is the most frequent operation, this is a fantastic bargain.
This is the essence of memory mastery. We looked at a common problem, the wrap-around check, and refused to accept it as a necessary evil. By understanding the abstraction layer between our program and the physical hardware, we were able to manipulate the mapping to create a data structure that is perfectly suited to our problem.