Sign up for the KDAB Newsletter
Stay on top of the latest news, publications, events and more.
Go to Sign-up
Sean Harmer
27 June 2019
In the first blog in this series, I showed how we solved the original problem of how to use mmap()
to load a large set of data into RAM all at once, in response to a request for help from a bioinformatics group dealing with massive data sets on a regular basis. The catch in our solution, however, was that the process still took too long. In this blog, I describe how we solve this, starting with Step 3 of the Process I introduced in Blog 1:
The original code we inherited was written on the premise that:
Of course, there were some slight flaws in this plan.
const std::vector colIndices = {0, 1, 2, 3, ... };
const std::vector markerIndices = randomise(colIndices);
for (i = 0; i < maxIterations; ++i) {
for (j = 0; j < numCols; ++j) {
const unsigned int marker = markerIndices[j];
const auto col = data.mappedZ.col(marker);
output += doStuff(col);
}
if (i % numIterations == 0)
writeOutput(output);
}
So, what can we do to make better use of the available cores? For technical reasons related to how Markov Chain Monte Carlo works, we can neither parallelize the outer loop over iterations nor the inner loop over the columns (SNPs). What else can we do?
Well, recall that we are dealing with large numbers of individuals - 500,000 of them in fact. So we could split the operations on these 500k elements into smaller chunks and give each chunk to a core to process and then recombine the results at the end. If we use Eigen for each chunk, we still get to keep the SIMD vectorization mentioned earlier. Now, we could do that ourselves but why should we worry about chunking and synchronization when somebody else has already done it and tested it for us?
This was an ideal chance for me to try out Intel's Thread Building Blocks library, TBB for short. As of 2017 this is now available under the Apache 2.0 license and so is suitable for most uses.
TBB has just the feature for this kind of quick win in the form of its parallel_for
and parallel_reduce
template helpers. The former performs the map operation (applies a function to each element in a collection where each is independent). The latter performs the reduce operation, which is essentially a map operation followed by a series of combiner functions, to boil the result down to a single value.
These are very easy to use so you can trivially convert a serial piece of code into a threaded piece just by passing in the collection and lambdas representing the map function (and also a combiner function in the case of parallel_reduce
).
Let's take the case of a dot (or scalar) product as an example. Given two vectors of equal length, we multiply them together component-wise then sum the results to get the final value. To write a wrapper function that does this in parallel across many cores we can do something like this:
const size_t grainSize = 10000;
double parallelDotProduct(const VectorXf &Cx, const VectorXd &y_tilde)
{
const unsigned long startIndex = 0;
const unsigned long endIndex = static_cast(y_tilde.size());
auto apply = [&](const blocked_range& r, double initialValue) {
const long start = static_cast(r.begin());
const long count = static_cast(r.end() - r.begin());
const auto sum = initialValue + (Cx.segment(start, count).cast() *
y_tilde.segment(start, count)).sum();
return sum;
};
auto combine = [](double a, double b) { return a + b; };
return parallel_reduce(blocked_range(startIndex, endIndex, grainSize), 0.0,
apply, combine);
}
Here, we pass in the two vectors for which we wish to find the scalar product, and store the start and end indices. We then define two lambda functions.
apply
lambda, simply uses the operator * overload on the Eigen VectorXf
type and the sum() function to calculate the dot product of the vectors for the subset of contiguous indices passed in via the blockedRange
argument. The initialValue argument must be added on. This is just zero in this case, but it allows you to pass in data from other operations if your algorithm needs it.combine
lambda then just adds up the results of each of the outputs of the apply
lambda.When we then call parallel_reduce
with these two functions, and the range of indices over which they should be called, TBB will split the range behind the scenes into chunks based on a minimum size of the grainSize we pass in. Then it will create a lightweight task object for each chunk and queue these up onto TBB's work-stealing threadpool. We don't have to worry about synchronization or locking or threadpools at all. Just call this one helper template and it does what we need!
The grain size may need some tuning to get optimal CPU usage based upon how much work the lambdas are performing but as a general rule of thumb, it should be such that there are more chunks (tasks) generated than you have CPU cores. That way the threadpool is less likely to have some cores starved of work. But too many and it will spend too much time in the overhead of scheduling and synchronizing the work and results between threads/cores.
I did this for all of the operations in the inner loop's doStuff() function and for some others in the outer loop which do more work across the large (100,000+ element) vectors and this yielded a very nice improvement in the CPU utilization across cores.
So far so good. In the next blog, I’ll show you how we proceed from here, as it turns out this is not the end of the story.
Stay on top of the latest news, publications, events and more.
Go to Sign-up
Learn Modern C++
Our hands-on Modern C++ training courses are designed to quickly familiarize newcomers with the language. They also update professional C++ developers on the latest changes in the language and standard library introduced in recent C++ editions.
Learn more