Little Trouble in Big Data - Part 1

how to use mmap() to load a large data set into RAM

23 May 2019

1. Preprocessing

The first thing we noticed when looking at the code they already had is that there is quite some work being done when reading in the data for each column. They do some summary statistics on the column, then scale and bias all the data points in that column such that the mean is zero. Bearing in mind that each column will be processed many times, (typically 10k - 1 million), this is wasteful to repeat every time the column is used.

So, reusing some general advice from 3D graphics, we moved this work further up the pipeline to a preprocessing step. The SNP data is actually stored in a compressed form which takes the form of quantizing 4 SNP values into a few bytes which we then decompress when loading. So the preprocessing step does the decompression of SNP data, calculates the summary statistics, adjusts the data and then writes the floats out to disk in the form of a ppbed file (preprocessed bed where bed is a standard format used for this kind of data).

The upside is that we avoid all of this work on every iteration of the Monte Carlo simulation at runtime. The downside is that 1 float per SNP per person adds up to a hell of a lot of data for the larger data sets! In fact, for the full data set it's just shy of 69 TB of floating point data! But to get things going, we were just worrying about smaller subsets. We will return to this later.

2. Loading the data

Even on moderately sized data sets, loading the entirety of the data set into physical RAM at once is a no-go as it will soon exhaust even the beefiest of machines. They have some 40 core, many-many-GB-of-RAM machine which was still being exhausted. This is where the original enquiry was aimed - how to use mmap(). Turns out it's pretty easy as you'd expect. It's just a case of setting the correct flags so that the kernel doesn't actually take a copy of the data in the file. Namely, PROT_READ and MAP_SHARED:

void Data::mapPreprocessBedFile(const string &preprocessedBedFile)
{
    // Calculate the expected file sizes - cast to size_t so that we don't overflow the unsigned int's
    // that we would otherwise get as intermediate variables!
    const size_t ppBedSize = size_t(numInds) * size_t(numIncdSnps) * sizeof(float);

    // Open and mmap the preprocessed bed file
    ppBedFd = open(preprocessedBedFile.c_str(), O_RDONLY);
    if (ppBedFd == -1)
        throw("Error: Failed to open preprocessed bed file [" + preprocessedBedFile + "]");

    ppBedMap = reinterpret_cast<float *>(mmap(nullptr, ppBedSize, PROT_READ, MAP_SHARED, ppBedFd, 0));
    if (ppBedMap == MAP_FAILED)
        throw("Error: Failed to mmap preprocessed bed file");

    ...
}

When dealing with such large amounts of data, be careful of overflows in temporaries! We had a bug where ppBedSize was overflowing and later causing a segfault.

So, at this point we have a float *ppBed pointing at the start of the huge 2D matrix of floats. That's all well and good but not very convenient for working with. The code base already made use of Eigen for vector and matrix operations so it would be nice if we could interface with the underlying data using that.

Turns out we can (otherwise I wouldn't have mentioned it). Eigen provides VectorXf and MatrixXf types for vectors and matrices but these own the underlying data. Luckily Eigen also provides a wrapper around these in the form of Map. Given our pointer to the raw float data which is mmap()'d, we can use the placement new operator to wrap it up for Eigen like so:

class Data
{
public:
    Data();

    // mmap related data
    int ppBedFd;
    float *ppBedMap;
    Map<MatrixXf> mappedZ;
}

void Data::mapPreprocessBedFile(const string &preprocessedBedFile)
{
    ...

    ppBedMap = reinterpret_cast<float *>(mmap(nullptr, ppBedSize, PROT_READ, MAP_SHARED, ppBedFd, 0));
    if (ppBedMap == MAP_FAILED)
        throw("Error: Failed to mmap preprocessed bed file");

    new (&mappedZ) Map<MatrixXf>(ppBedMap, numRows, numCols);
}

At this point we can now do operations on the mappedZ matrix and they will operate on the huge data file which will be paged in by the kernel as needed. We never need to write back to this data so we didn't need the PROT_WRITE flag for mmap.

Yay! Original problem solved and we've saved a bunch of work at runtime by preprocessing. But there's a catch! It's still slow. See the next blog in the series for how we solved this.

Tags:

c++performance tools

1 Comment

I would have done things slightly different at a few points.

First, your casting is inconsistent:

const size_t ppBedSize = size_t(numInds) * size_t(numIncdSnps) * sizeof(float);
ppBedMap = reinterpret_cast(mmap(...

Since you already had problems with overflows you may agree that finding casts easily is a good thing, which makes searching for "_cast<" handy. So this boils down to:

const size_t ppBedSize = static_cast(numInds) * static_cast(numIncdSnps) * sizeof(float);
ppBedMap = static_cast(mmap(...

Yes, the latter one can also be a static_cast, as casting from void* is permitted.

Next is, given that you opened the file with O_RDONLY and mapped it with PROT_READ I would expect ppBedMap to be "const float *" for all the good reasons that "const" has.

And then there is ppBedFd, which I suspect you don't need anymore. Once you have mmap()'d a file you can close the file descriptor unless you need it for other fancy things

Sean Harmer

Managing Director KDAB UK

Dr Sean Harmer is a senior software engineer at KDAB where he heads up our UK office and also leads the 3D R&D team. He has been developing with C++ and Qt since 1998 and is Qt 3D Maintainer and lead developer in the Qt Project. Sean has broad experience and a keen interest in scientific visualization and animation in OpenGL and Qt. He holds a PhD in Astrophysics along with a Masters in Mathematics and Astrophysics.

Little Trouble in Big Data - Part 1

how to use mmap() to load a large data set into RAM

1. Preprocessing

2. Loading the data

1 Comment

Related Content

Sign up for the KDAB Newsletter

Little Trouble in Big Data - Part 1

how to use mmap() to load a large data set into RAM

1. Preprocessing

2. Loading the data

1 Comment

Related Content

10 Tips to Make Your QML Code Faster and More Maintainable

Mastering Cross-platform Desktop Apps

Introducing the ConnectionEvaluator in KDBindings

Sign up for the KDAB Newsletter