Memory Magic Part 4: The Infinite Buffer

Open Table of contents

The End of the Line
The Classic Ring Buffer and its Two-Copy Problem
The Virtual Memory Trick: A Mirrored Reality
Seeing the Magic in Action
- Building the Mirrored Buffer on Windows
- Building the Mirrored Buffer on POSIX (Linux/macOS)
Taking it Further
Conclusion: From Trick to Tool

The End of the Line

A common performance problem lives in the heart of systems that stream data: audio engines, network stacks, logging systems, and inter-thread message queues. It’s the problem of the circular buffer, also known as a ring buffer. And its problem is the “end of the line.”

The Classic Ring Buffer and its Two-Copy Problem

A ring buffer is a brilliant data structure for producer-consumer scenarios. One thread, the producer, writes data into the buffer, and another thread, the consumer, reads it. It’s a fixed-size block of memory that cleverly pretends to be infinite by wrapping around on itself.

This works beautifully, until the head pointer reaches the end of the allocated memory block. At that point, it must wrap back around to the beginning.

This wrap-around requirement complicates what should be a simple write operation. If a write operation crosses the buffer’s boundary, it must be split. The programmer must calculate the size of the chunk to the end of the buffer, perform one memcpy, then calculate the size of the remaining data and perform a second memcpy to the start of the buffer.

void WriteToBuffer(RingBuffer* rb, const void* data, size_t size) {
    // ... space check omitted ...
    size_t current_head = rb->head;

    if (current_head + size <= rb->capacity) {
        // Case 1: The write fits in one contiguous block.
        memcpy(rb->buffer + current_head, data, size);
    } else {
        // Case 2: The write wraps around. It must be split.
        size_t first_chunk_size = rb->capacity - current_head;
        memcpy(rb->buffer + current_head, data, first_chunk_size);

        size_t second_chunk_size = size - first_chunk_size;
        memcpy(rb->buffer, (char*)data + first_chunk_size, second_chunk_size);
    }
    
    // Finally, update the head pointer, wrapping it around with modulo.
    rb->head = (current_head + size) % rb->capacity;
}

Two Copies Problem

This conditional logic is a performance killer. The if statement is a prime candidate for branch misprediction, which stalls the CPU. Furthermore, the API is messy. You can no longer simply ask for a pointer to a contiguous block of memory to write into, the logic must handle the possibility of two disjoint blocks for a single logical write.

What if we could eliminate the cliff? What if we could create a buffer that was, for all intents and purposes, truly linear?

The Virtual Memory Trick: A Mirrored Reality

The solution lies in using our deep knowledge of virtual memory to create a new reality. The trick is elegant:

We will ask the OS to map the same block of physical RAM to two contiguous regions of virtual address space.

The result is a virtual buffer where the second half is a perfect, byte-for-byte mirror of the first half. Writing to buffer[0] also writes to buffer[capacity]. They are the same memory.

Virtual Memory Mapping

This technique is most practical in 64-bit applications where virtual address space is plentiful. The alignment requirements for the buffer size are specific to each operating system, as we’ll see below.

Seeing the Magic in Action

With this buffer, the two-copy problem vanishes. We can perform a single memcpy that seamlessly wraps around the buffer boundary, with no special logic.

The following code demonstrates this. We write a string that is intentionally placed to cross the boundary of the physical memory.

const size_t requested_size = 65536;
char* buffer = CreateMirroredRingBuffer(requested_size);

const char message[] = "HELLO!";
size_t message_len = sizeof(message) - 1;
size_t start_index = requested_size - 3; // Position write to force a wrap-around.

// This single memcpy will cross the boundary.
memcpy(&buffer[start_index], message, message_len);

// We can now read the entire, contiguous message by pointing
// to the start of the write in our "virtual" buffer.
printf("Contiguous message after wrap-around: %.*s\n",
       (int)message_len, &buffer[start_index]);

The output proves that the single memcpy worked as if the buffer were truly linear. The memory region starting at start_index now contains the full, unbroken message, with the wrap-around handled transparently by the hardware.

Contiguous message after wrap-around: HELLO!

Now that we’ve seen what this buffer can do, let’s look at how to build it on different platforms.

Building the Mirrored Buffer on Windows

The modern Windows implementation revolves around VirtualAlloc2 and MapViewOfFile3, which allow us to create and map into memory “placeholders” in a race-free way. The process is more nuanced than it first appears.

Step 1: Align Size to Granularity

On Windows, the size of our buffer must be a multiple of the system’s allocation granularity (typically 64KB). We get this value from GetSystemInfo and round our requested size up to the next valid multiple.

SYSTEM_INFO sysInfo;
GetSystemInfo(&sysInfo);
const size_t granularity = sysInfo.dwAllocationGranularity;

// Round the requested size up to the nearest multiple of granularity.
size_t aligned_size = (size + granularity - 1) & ~(granularity - 1);

Step 2: Create a Page-File-Backed Memory Section

We use CreateFileMapping to ask the OS for a chunk of committable memory. Because this is an older Win32 API, we must manually split our 64-bit aligned_size into high and low 32-bit parts for the function arguments.

// Split the 64-bit size for the CreateFileMapping function.
DWORD size_high = (DWORD)(aligned_size >> 32);
DWORD size_low = (DWORD)(aligned_size & 0xFFFFFFFF);

HANDLE mapping = CreateFileMapping(INVALID_HANDLE_VALUE, NULL,
                                   PAGE_READWRITE, size_high, size_low, NULL);

Step 3: Reserve a Combined Virtual Address Placeholder

We reserve a single, contiguous virtual address range that is twice the size of our aligned buffer. This is the canvas we will work on.

void* placeholder = VirtualAlloc2(GetCurrentProcess(), NULL, 2 * aligned_size,
                                  MEM_RESERVE | MEM_RESERVE_PLACEHOLDER,
                                  PAGE_NOACCESS, NULL, 0);

Step 4: Split the Placeholder

This is a crucial and non-obvious step. MapViewOfFile3 requires the placeholder it replaces to be of the exact same size as the view being mapped. We cannot map a size-byte view into a 2*size-byte placeholder.

To solve this, we perform a split. By calling VirtualFree on the second half of the reservation with the special MEM_PRESERVE_PLACEHOLDER flag, we instruct the OS to redefine the single large placeholder as two smaller, independent placeholders. This single call is all that’s needed to prepare our address space.

// Split the 2*size placeholder into two separate 'size' placeholders.
VirtualFree((char*)placeholder + aligned_size, aligned_size,
            MEM_RELEASE | MEM_PRESERVE_PLACEHOLDER);

Step 5: Map the Memory Section into Each Placeholder

Now that we have two correctly-sized placeholders, we can map our memory section into each one using MapViewOfFile3 with the MEM_REPLACE_PLACEHOLDER flag. This atomically consumes the placeholder and replaces it with our mapping.

// Map the first half into the first placeholder.
void* view1 = MapViewOfFile3(mapping, GetCurrentProcess(), placeholder,
                              0, aligned_size, MEM_REPLACE_PLACEHOLDER,
                              PAGE_READWRITE, NULL, 0);

// Map the second half into the second placeholder.
void* view2 = MapViewOfFile3(mapping, GetCurrentProcess(),
                              (char*)placeholder + aligned_size, 0, aligned_size,
                              MEM_REPLACE_PLACEHOLDER,
                              PAGE_READWRITE, NULL, 0);

If both calls succeed, we have our mirrored buffer. The mapping handle can now be closed, as the views will keep the memory alive.

A Note on Older Windows Versions: The VirtualAlloc2 and MapViewOfFile3 functions are not available on older systems. On those platforms, you must use a fallback method that involves reserving a 2*size block with VirtualAlloc, immediately freeing it with VirtualFree, and then attempting to map both halves into the now-vacant address range with MapViewOfFileEx. This older method has an inherent race condition: another thread could allocate memory in that address range between the VirtualFree and MapViewOfFileEx calls. Robust code for older systems must put this logic in a retry loop.

Building the Mirrored Buffer on POSIX (Linux/macOS)

The POSIX implementation uses mmap for everything, but first requires a sharable file descriptor.

Step 1: Round Up to Page Size

On POSIX systems, the mmap function requires that the size of a mapping be a multiple of the system’s page size (typically 4KB). We get this value from sysconf and round our requested size up to the next valid multiple.

long page_size = sysconf(_SC_PAGESIZE);

// Round the requested size up to the nearest multiple of the page size.
size_t aligned_size = (size + page_size - 1) & ~((size_t)page_size - 1);

Step 2: Get a Sharable File Descriptor

We need a file descriptor (fd) that represents the physical memory we want to map. The best way to get one is with memfd_create on Linux, as it creates a truly anonymous in-memory file that won’t conflict with anything on the filesystem. For portability with systems like macOS, we must fall back to shm_open.

shm_open requires a unique name. We will generate one using the process ID and a counter, and retry if a name collision occurs.

int fd = -1;
#if __linux__
// memfd_create is preferred: no name collisions and no filesystem presence.
fd = syscall(SYS_memfd_create, "ring-buffer", MFD_CLOEXEC);
#endif
if (fd == -1) {
    // shm_open is the POSIX fallback for macOS and older Linux.
    char path[256];
    int retries = 100;
    do {
        // Generate a unique name.
        snprintf(path, sizeof(path), "/ring-buffer-%d-%d", getpid(), retries);
        
        fd = shm_open(path, O_RDWR | O_CREAT | O_EXCL | O_CLOEXEC, 0600);
        
        retries--;
    } while (fd < 0 && errno == EEXIST && retries > 0);

    if (fd < 0) return nullptr;
    
    // Immediately unlink the path. The memory object will persist
    // until the last fd is closed, ensuring automatic cleanup.
    shm_unlink(path);
}

// Set the size of the memory object.
ftruncate(fd, (off_t)aligned_size);

Step 3: Reserve a Contiguous Virtual Address Range

The core of the trick is getting two contiguous virtual memory blocks. The safest way to do this is to ask the OS to reserve a single, larger block that is twice the size of our desired buffer. We use mmap with PROT_NONE to reserve the address range without allocating any physical memory for it yet. This prevents other threads from stealing our address space before we can map into it.

void* placeholder = mmap(nullptr, 2 * aligned_size, PROT_NONE,
                         MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

Step 4: Map the File Descriptor Twice

Finally, we use mmap again to map our file descriptor from Step 2 into the two halves of the placeholder we reserved in Step 3. The MAP_FIXED flag tells mmap to use the exact address we provide. It is crucial to also use MAP_SHARED, which ensures that writes to one mapping are visible in the other.

// Map the first half.
void* view1 = mmap(placeholder, aligned_size, PROT_READ | PROT_WRITE,
                   MAP_FIXED | MAP_SHARED, fd, 0);

// Map the second half.
void* view2 = mmap((char*)placeholder + aligned_size, aligned_size,
                    PROT_READ | PROT_WRITE,
                    MAP_FIXED | MAP_SHARED, fd, 0);

If both calls succeed, the file descriptor fd can be safely closed. The kernel maintains a reference to the underlying memory object for as long as the mappings exist.

A Note on POSIX Gotchas:

shm_open Naming: The shm_open fallback uses filesystem-like names that can collide between processes. Our retry loop with a unique name helps prevent this, but production code might use an even more robust random name generation strategy.

MAP_SHARED is Mandatory: The final two mmap calls must use MAP_SHARED to ensure writes to one view are visible in the other, creating the mirror. Using MAP_PRIVATE would break the illusion due to copy-on-write semantics.

Taking it Further

This virtual memory trick is a powerful building block for a variety of high-performance systems. With the core problem of wrap-around logic eliminated, you can focus on other challenges. Here are a few things you could build:

Lock-Free SPSC/MPSC Queues: This is the perfect foundation for single-producer or multi-producer queues, where the producer(s) can write data of any size without complex logic or locking.
Real-Time Audio Buffers: In audio processing, you often need to apply a DSP filter to a sliding window of sample data. A mirrored buffer allows you to get a single pointer to this window, even if it spans the “end” of the buffer, avoiding branches in your tight audio callback loop.
Network Packet Assembly: When receiving network data, you can write incoming fragments directly into the buffer. A single send or processing call can then operate on a complete, contiguous packet once all its bytes have arrived, without any intermediate copying.
Game Engine Command Buffers: A renderer’s command buffer can be implemented as a ring buffer. With this trick, the CPU can generate rendering commands in a tight, branch-free loop, writing them into the buffer for the GPU to consume.

Conclusion: From Trick to Tool

With this technique, we have transformed our ring buffer. The code that uses it becomes dramatically simpler. A Write operation is no longer a conditional dance of two memcpy calls, it’s a single, clean, branch-free operation.

Of course, there is a trade-off. We are trading virtual address space and a few extra TLB entries for a branch-free hot path. For many streaming applications, where the buffer write/read is the most frequent operation, this is a fantastic bargain.

This is the essence of memory mastery. We looked at a common problem, the wrap-around check, and refused to accept it as a necessary evil. By understanding the abstraction layer between our program and the physical hardware, we were able to manipulate the mapping to create a data structure that is perfectly suited to our problem.