Making the Most of your Memory with mmap

20 March 2019

Tags

Sometimes it seems that we have nearly infinite memory resources, especially compared to the tiny 48K RAM of yesteryear’s 8-bit computers. But today’s complex applications can soak up megabytes before you know it. While it would be great if developers planned their memory management for all applications, thinking through a memory management strategy is crucial for applications with especially RAM intensive features like image/video processing, massive databases, and machine learning.

How do you plan a memory management strategy? It’s very dependent on your application and its requirements, but a good start is to work with your operating system instead of against it. That’s where memory mapping comes in. mmap can make your application’s performance better while also improving its memory profile by letting you leverage the same virtual memory paging machinery that the OS itself relies on. Smart use of the memory mapping API (Qt, UNIX, Windows) allows you to transparently handle massive data sets, automatically paging them out of memory as needed – and it’s much better than you’re likely to manage with a roll-your-own memory management scheme.

Here’s a real-life use case of how we used mmap to optimize RAM use in QiTissue, a medical image application. This application loads, merges, manipulates, and displays highly detailed microscope images that are up to gigabytes in size. It needs to be efficient or risks running out of memory even on desktops loaded with RAM.

QiTissue highlighting tumor cell division on a digital microscopic image

The above image is stitched together from many individual microscope images, and the algorithm needs access to many large bitmaps to do that – a pretty memory-intensive process. Capturing a memory snapshot of the application in action shows that memory use grows as the application pulls in the images required for the stitching algorithm. The top purple line, Resident Set Size (RSS), is the amount of RAM that belongs to the application and that physically resides in memory. That curve reveals that memory use fluctuates but tops out over 6GB.

Memory use before mmap optimization

We used mmap extensively in our rewrite of QiTissue to help its overall memory profile. In this case, we decompress the microscope images into files and then memory map those files into memory. Note that using memory-mapped files won’t eliminate the RAM needed to load, manipulate, or display the images. However, it does allow the OS to do what it does best – intelligently manage memory so that our application operates effectively with the RAM that’s available.

Memory use after mmap optimization

The post-mmap optimization memory diagram looks similar, so what are the practical differences?

Heap consumption drops. For our stitching algorithm, the heap size drops from around 500MB down to around 35MB, which is a fraction of its original size. The memory hasn’t disappeared; it’s still being used and counts against the application’s RSS total. However, because the memory used isn’t part of the heap, it doesn’t come out of the OS’s virtual memory paging file. That’s important because there’s a cap on the total amount of RAM that can be allocated. The size of the paging file is usually set to be around 1.5 to 2 times the physical RAM, and this size limit prevents the system from wasting all its time swapping memory in and out from disk. That means that even when using virtual memory, the total amount of memory a program can access is limited. By mmapping memory, you can intelligently exceed that cap if you need to.
Less dirty memory. Dirty memory is memory that has been written to and as a result, no longer reflects the copy of the memory paged in from disk. If a memory block is dirty, the OS has to write it back out to disk where it’s paged out, and that write introduces a huge performance hit. Why does our dirty memory drop? The heap is actually a living data structure that manages all the news/deletes/mallocs/frees that your application makes. As normal C++ code lives and breathes, the C++ malloc libraries must maintain its linked lists of memory blocks by updating heap structures. Of course once touched, that memory must be flushed back to disk. Moving our big data out of the heap and into mmapped files prevents dirty flushes of those large memory structures, saving a lot of unnecessary CPU cycles.
Better RSS profile. Moving the majority of application data out of the heap and into mmapped files also trims down the structures needed to maintain that data. An operating system’s raw memory paging, which mmap leverages, may be limited to 4096 byte blocks, but it’s pretty efficient in managing those blocks. That is reflected in the memory required when we convert over to a mmap-based architecture. Not only do we need a smaller amount of resident RAM (6.2GB vs 5.8GB), but also our overall RAM consumption profile peaks less and recovers faster when it does peak. Meaning that there is more memory left for all the other OS tasks and more RAM that can be used before paging memory back in from disk is required.

Best of all, incorporating mmap isn’t too difficult to do. Less memory, faster, and easy: what’s not to like? I’ve put together a small example on GitHub if you want to experiment around with your own mmap application: https://github.com/milianw/mmap_demo/.

Tags:

c++performance tools

2 Comments

Hi,

Interesting idea, thanks. Actually, mapping can be used for any data known (existing) before the application starts.

But the diagrams show a small performance degradation in case of mmap. Image loading was finished before 60 seconds in case of heap and clearly after 60 seconds with mmap.

I guess it's because of additional IO operations for writing decompressed files to disk. It would be interesting to compare heap/mmap performance with uncompressed files.

The performance degradation comes from the pmap sampling. If you compare https://github.com/milianw/mmap_demo/blob/master/mmap.txt with https://github.com/milianw/mmap_demo/blob/master/no_mmap.txt, you'll see that using mmap for this toy example is significantly faster. But when you compare these time values to the ones in the graphs, e.g. https://github.com/milianw/mmap_demo/blob/master/no_mmap.png, you'll notice that they are very different. That means: don't compare the time values in the graph, rather only concentrate on the Y-axis values there.

Milian Wolff

Senior Software Engineer

Milian Wolff has a long history of creating tools for C++ developers. He’s the main author of Massif-Visualizer, heaptrack, hotspot and ctf2ctf tools now used widely to improve C++ applications performance. He’s a Senior Software Engineer at KDAB where he enjoys solving hard performance problems and teaching developers about debugging and profiling tools. Milian has a Masters Degree in Physics.