# A Speed-Up for Charting on Embedded Introducing Min–Max Trees to Summarize Large Amounts of Data

I’d like to talk about a common problem that we have seen come up in several projects, namely how to plot large datasets over time in an efficient manner using Qt. This seems to be a general issue, especially in embedded situations and is affecting many people. Because of this, I thought we’d share one approach that we have found to work well in some situations, maybe it helps you out in your project.

## Problem: Waveform Graphs of Large Data Sets

We have a situation where we are accumulating more than 500.000 samples of sensor data. The data is appended over time to a huge array. We want to visualize this growing amount of sensor information as a waveform graph: To intelligently render this on low-profile hardware, we cannot visit all the data points in a single render pass, but we need to cheat. Instead of drawing lines between all the hundreds of thousands of data points, we draw each vertical pixel column as one line reaching from the minimum sample to the maximum sample. With this method, we only need to render a few hundred lines instead of hundreds of thousands to reach the same visual result. To pull off this trick, however, we need to query minmax(beginIndex, endIndex) to obtain the range of values for which to draw a line very efficiently and very often. For this, I developed a MinMaxTree data structure which can offer high speeds for this query. The effective API of such a data structure can look like this:

template <class T> class MinMaxTree { public: explicit MinMaxTree(); void append(T value); MinMax<T> getMinMaxInRange(size_t rangeBegin, size_t rangeEnd) const; // ... };

This API allows you to insert a single data sample to store new sensor data. It also lets you retrieve the minimum and maximum given *begin* and *end* indices, which we can then use to render a line in our waveform.

## Crafting an Append-Only Min–Max Tree as a Search Index

Remembering computer science classes, I figured that reusing prior calculations can save a lot of work (*dynamic programming*). In addition, given very large data sets, trees can cut your query times tremendously by intelligently arranging data or—to put it differently—by allowing us to skip a vast amount of data quickly.
The solution I explain in this blog post uses both approaches at their best.
Here is the tree I modeled on top of the data:
The tree summarizes ranges of data points in the underlying array. I designed the tree nodes to contain the minimum and maximum values of their range. Since parent nodes represent the data of their left and right children, they contain the minimum and maximum of their combined ranges. Finally, the root node contains the minimum and maximum values of the entire data set.
Because I want to keep the tree small and profit from caching effects, each node represents a sub-range or “bucket” of the array. The bucket size can be adjusted to match the cache sizes of the hardware best. This keeps the tree flat while still enabling fast linear scans inside the bucket.

## Appending a Value, Updating the tree

There are two tasks that come with insertion: *updating the tree* and, optionally, *extending the tree*.
When appending a value, it needs to *update* the overlying tree structure. When inserted, the value needs to find and update its respective tree leaf, which, in turn, must inform its parent node and so on.
I hope that it’s easy to see that if a node’s minimum or maximum do not change, it does not need to inform its parent node. Using this optimization, the average-case complexity of insertion is very low.
The code snippet below illustrates this recursive “bubbling up” of a value trough the tree:

template<T>; void MinMaxTree<T>::bubbleUpMinMax(T value, size_t treeIndex) { //updates and returns true, if we altered minmax in the node auto minmax = [&](T value) -> bool { auto &node = m_treeOntopOfBuckets.at(treeIndex); const auto oldmin = node.min; const auto oldmax = node.max; node.min = std::min(value, node.min); node.max = std::max(value, node.max); return (value < oldmin || value > oldmax); }; //we are at the root, break recursion if (treeIndex == m_treeCapacity/2) { minmax(value); return; } //update node and optionally bubble up further else { if (minmax(value)) bubbleUpMinMax(value, parent(treeIndex)); } }

The second problem when inserting a value is that our tree structure might need to grow, because the new value breaches the capacity of the existing tree.
Here, the tree needs to *extend* “sideways” to leave its complete old structure intact and form a new node on top.
For this, I

- double the trees size and mirror its current structure to extend its reach,
- make a new root,
- copy the data from the old root into the new root.

Now that we have doubled the tree size, we can again insert data until we need to expand again. A note on the default values inside our nodes: I initialize all nodes with the highest possible value as minimum and lowest possible value as maximum.

template<T> MinMax{ T min = std::numeric_limits<T>::max(); T max = std::numeric_limits<T>::lowest(); //it’s not ::min(), this costed me hours };

So when actual real values enter the array, they will change `min`

and `max`

, because they are different to these default extremes.
**BE WARNED!** `std::numeric_limits::min()` represents the smallest positive representation of a value, not the lowest possible number. I learned it the hard way.

## Indexing: Squeezing the Tree into an Array

In our case, I wanted to optimize the tree accesses and its growth without using a pointer implementation that would link a node to its children using pointers. Instead, I adopted a commonly used trick to squeeze heaps into arrays, by letting the left child of an element be at 2 × index and the right child at 2 × index + 1. While it seems that I could have used this approach for this project as well, growing the tree would require me to insert and make space at many places. Instead, I went with an infix indexing method, putting the tree into an array in a “left node–root node–right node” order like this: This is nice for several reasons:

- It eliminates the need for the use of pointer chasing.
- Nodes and their parents are fairly close (in the lower tree).
- Expanding the tree is just doubling the array in size.
- The root node will be in the middle of the array.

To have convenient ways of accessing *rightChild* and *leftChild* as well as the *parentNode* from a certain index, we have to look at the binary representation of our indices.
Note how the parent index, for instance, always is the next higher “pure” power of 2. It would be a bit cumbersome to explain all the magic bit in text form, instead the code tells it best:
**leftChild**
Find lowest set bit, unset it, and set the one-lower bit

template<T> size_t MinMaxTree<T>::leftChild(size_t index) const { int lsb = ffs(index); index &= ~(1UL << (lsb-1)); index |= 1UL << (lsb-2); return index; }

**rightChild**
Find lowest set bit and set the one-lower bit

template<T> size_t MinMaxTree<T>::leftChild(size_t index) const { int lsb = ffs(index); index |= 1UL << (lsb-2); return index; }

**parentNode**
Find lowest set bit, unset it, and set the one-higher bit

template<T> size_t MinMaxTree<T>::parent(size_t index) const { int lsb = ffs(index); index &= ~(1UL << (lsb-1)); index |= 1UL << lsb; return index; }

All Functions use the glibc-provided intrinsic ffs, which can be found in `<strings.h>`

(`FindFirstSet`

to identify the lowest-significant set bit). On Windows, the intrinsic `BitScanForward`

can be used to accomplish this.
Similarly, I wrote a function to return the actual range in our data array a particular node covers.

## Querying a Range

Now that we have all tools in hand, we can finally implement the range query itself.
We recursively query the tree nodes, starting from root downward. We check how a node’s range relates to our query range and query its children for the minimum and maximum, according to the following few cases:
**Case 1:** The node’s range is outside of the query range → Return the uninitialized min/max.
**Case 2:** The node’s range is fully enclosed in the query range → That’s the best case, we just return the node’s min/max.
**Case 3:** The node’s range is partly inside, partly outside the query range or the node covers more than the range and we can split → Split, query left and right, and combine the results.
**Case 4: **The node’s range overlaps and we are at a leaf node → We need to scan through the bucket.

template<T> MinMax<T>::queryNode(size_t nodeIndex, size_t rangeBegin, size_t rangeEnd) const { // […] calculate Node Range, store in NodeBegin, NodeEnd // node outside range, return empty MinMax() // |------range----| // |---node---| or |---node---| // this should never happen, but safe is safe if ((nodeEnd < rangeBegin) || (nodeBegin > rangeEnd)) return MinMax<T> // range spans node fully, return nodes' MinMax // |---------range-----------| // |------node------| if ((rangeBegin <= nodeBegin) && (rangeEnd >= nodeEnd)) return m_treeOntopOfBuckets[nodeIndex]; // node cuts range left or right or both // |------range----| // |---node---| or |---node---| or // |--------node-------| if ((nodeBegin <= rangeBegin ) || (nodeEnd >= rangeEnd)) { const MinMax<T> leftMinMax = queryNode(leftChild(nodeIndex), rangeBegin, rangeEnd); const MinMax<T> rightMinMax = queryNode(rightChild(nodeIndex), rangeBegin, rangeEnd); MinMax<T> resultMinMax; resultMinMax.min = std::min(rightMinMax.min, leftMinMax.min); resultMinMax.max = std::max(rightMinMax.max, leftMinMax.max); return resultMinMax; } // Node is leaf, re-scan its bucket for subrange // |---------range-----------| // |----node(leaf)----| or |----node(leaf)----| if( nodeIndex % 2 == 1) { MinMax<T> scanResult; for(size_t i = std::max(nodeBegin, rangeBegin); i <= std::min(nodeEnd, rangeEnd); i++) { scanResult.min = std::min(m_dataInBuckets[i], scanResult.min); scanResult.max = std::max(m_dataInBuckets[i], scanResult.max); } return scanResult; } }

## Results

Visualizing **300.000 data points took more than 2 seconds** with the naïve approach (drawing lines from point to point) on the embedded test hardware, while **this approach brought down the time for queries to about 3 ms**, such that the pure line rendering now even accounts for the bigger part of the cost (6 ms). We again have reached levels below the juicy target of 16 ms.

## Further Optimization: Find Lowest Common Parent

In a similar setup, my colleague James Turner found that many of our node visits actually were searching down the tree for the lowest common node covering the full range. In these cases, we often ended up splitting the node in its two children, one of which containing the range fully and the other not containing the range at all. This initial search down the tree took most of the node visits, while the actual desired min max scan (collecting all the nodes containing the range) was less than that. So instead of searching from the top, beginning at the root node, I now calculate the left and right leaves, then determine the lowest common node spanning the full range, and perform the query from there (which is a simple loop on bit operations). I hope, you found this useful. I am happy to hear your comments about this solution. At KDAB, we do these kind of optimizations specific to low-end hardware all the time, ask us if we can help you or check out our Profiling Workshop to learn how to find and tackle performance bottlenecks in your own projects.

Great post! This looks like you “(re-)invented” one of the polyline simplification algorithms. We implemented couple of such algorithms in LabPlot last year (https://labplot.kde.org/2017/04/09/labplot-2-4-released/). Those are documented in https://phabricator.kde.org/T3507 – in case you’re interested, you can maybe check this collection of useful links here. I think KST hat also something similar in place.

Independent of this, your results with 300k points visualized within 2 seconds on the embedded hardware are still very impressive. Can you provide a small code snippet for how you benchmarked this?

Hi, Alexander.

Thank you for your comment!

It is interesting and in the same time a bit sad, how often engineers need to reinvent concepts although solutions exist.

In the same time, each approach can find new tricks and might be custom-tailored to the needs of the individual problem.

I think with “within 2 seconds” you might mean the actual end-result (3ms for the queries)?

(2 seconds was a single renderpass on the original implementation, that was bad.)

I just used

`QElapsedTimer t; t.start(); […renderloop…] auto time = t.nsecs();`

around the data-render-pass which consistet of about 150 pixel columns in one loop.Because I was curious, I did the measurement twice: once with the regular QPainter::drawLine() and one without the QPainter part.

This is how I came to the results which I describe in the Text. Of course QElapsedTimer has some overhead here, but in the realm of miliseconds, I considered it less of an issue.

There is one more aspect, that I did not describe in the Blog: Imagine an initial state with not-so-much sensor data. In the process when more and more data accumulates, the graph might need to switch from point-to-point-rendering to this optimized rendering. I can imagine, that in programs like LabPlot, handling this transition, such that it does not appear as a visual glitch is quite challenging.

PS: In the meantime some people on the cpp-reddit found a bit more about the theoretical concepts behind all of this:

WP: Range Miniumum Query and WP: Cartesian tree

Actually Qwt does something similar, when enabling QwtPlotCurve::FilterPointsAggressive ( since Qwt 6.2 ).

Maybe interesting in this context: the Qt raster paint engine had a nasty bug for a couple of versions, when drawing a vertical line of one 1 pixel width, that affected this optimization.

Interesting approach that I will have to remember for future use.

I’ve recently run into a similar issue of storing and displaying a huge number of data points (10 days of data with two sets of data produced 4 times a second). However I had to use C# and Live-Charts.

The approach I took was to calculate the start and end index into the circular buffer for the segment that I needed to display. The data was then thinned based upon the length of the time segment being displayed.

Needless to say WPF made it hard to get the required performance.