Fast Duplicate Tracking DuplicateTracker Merged to KDToolBox

07.09.2021

Marc Mutz

No comments

In 2019, I optimized QStringList::removeDuplicates() by using std::pmp::unordered_set with a std::pmr::monotonic_buffer_resource, when available. The class that I wrote to encapsulate this optimization has since been re-implemented three times. The latest iteration has recently landed in KDToolBox.

If you have code that looks a bit like this:

SomeLocalContainer seen;
~~~~
    if (seen.contains(x))
        continue;
    seen.insert(x);

then you should read on.

Before my change, the QStringList::removeDuplicates() function used a QSet to keep track of “seen” strings, so it could remove already-seen strings from the list the second time they are encountered, thereby removing all duplicate strings from the list.

There are quite a few problems with this approach.

Double Lookup

The contains() + insert() anti-pattern performs two look-ups for the same key, because the position that contains() internally has calculated is lost by the time contains() returns and thus has to be re-calculated in insert() again. We know that the position will be the same, but the compiler doesn’t. The STL associative containers return a pair (boooooo!) of iterator and bool, with the latter indicating whether the item was inserted (true) or was found to pre-exist (false). Since we want to insert anyway, we can just attempt the insert() and check whether it changed something:

std::unordered_set<~~~> seen;
~~~~
    if (auto [it, inserted] = seen.insert(x); !inserted)
        continue; // was already present
    // no need for seen.insert(x): already done

While this is still quite a mouthful, we’ve managed to reduce the look-ups by half.

The Qt containers lack this convenient API, of course, instead forcing the user to check whether the size of the container changed, like this:

QSet<~~~~> seen;
~~~~
    const auto oldSize = seen.size();
    seen.insert(x);
    if (seen.size() == oldSize)
        continue; // was already present
    // no need for seen.insert(x): already inserted

It’s odd, but works, too.

Allocations, Allocations

While QSet or std::unordered_set are fast once constructed, they both allocate a new node per element inserted. So in the usual case where you don’t have (many) duplicates, you’re performing O(N) memory allocations, N being the number of candidates to check.

How nice it would be if we could combine the speed of a hash table with the cache-friendliness of, say, a QVarLengthArray. Oh, wait. We can!

In C++17, we finally got some of the Bloomberg Allocators merged and, while quite a few implementations are still behind implementing them, that shouldn’t deter us, thanks to feature test macros.

Taking a look at the std::pmr namespace, we find a few template aliases that appear to implant a non-standard allocator, std::pmr::polymorphic_allocator, into the usual range of STL containers, including vector and unordered_set.

If you want to know all the gory details of that class, book one of KDAB’s Modern C++ trainings. Suffice to say that std::pmr::polymorphic_allocator implements an allocator that gets memory from a std::pmr::memory_resource, which is an interface you can inherit to implement your own custom allocation strategies.

Enter `monotonic_buffer_resource`

But we don’t even need to do that. We just need to plug one of the existing memory_resources, std::pmr::monotonic_buffer_resource, into std::pmr::unordered_set.

The monotonic_buffer_resource class implements a grow-only memory pool over a static char buffer where deallocation is a no-op and allocation just ups a pointer. In case the initial buffer is exhausted, monotonic_buffer_resource allocates a new block and continues to allocate from there. If that’s exhausted too, it allocates an even larger block, etc.

This allocator is as fast as it gets, but it isn’t thread-safe and it never frees memory (until destroyed itself). But that’s perfect for building temporary containers whose lifetimes are bound by a function.

The setup is a bit verbose, due to the flexibility of the std::pmr framework. But it’s always the same:

char buffer[1024]; // or whatever size fits
std::pmr::monotonic_buffer_resource res{buffer, sizeof buffer};
std::unordered_set<~~~> seen(N, &res); // N: expected number of elements
~~~~
    if (auto [it, ins] = seen.insert(x); !ins)
        continue;
    // already inserted

Now the first 1 KiB of memory is allocated on the stack, and only then do we go to the heap. Passing the expected number of elements to the unordered_set constructor (that’s like a reserve(), not a resize()!) is very important here because, while a re-hash wouldn’t take the nodes out of buffer, it would waste memory for the new bucket list because monotonic_buffer_resource doesn’t re-use freed-up memory again, thus creating gaps in our precious, buffer.

Tying It All Together

So, we have three things that we need to keep in mind:

Avoid double lookups,
Use std::pmr::monotonic_buffer_resource to avoid memory allocations,
…but fall back to QSet or std::unordered_set when std::pmr isn’t available.

Then, it’s just a question of writing a good hash function (which has recently become much simpler).

QDuplicateTracker (private API) bound these together for the first time, with a much nicer API:

QDuplicateTracker<~~~> seen;
seen.reserve(N);
~~~~
    if (seen.hasSeen(x))
        continue;
    // already inserted

Once you know which patterns to look for, the use-cases seem endless: Qt changes in topic “qduplicatetracker”.

I don’t recommend using QDuplicateTracker from Qt 5 anymore though, since, as I mentioned, I have taken this class through another two iterations in customer projects, finally culminating in the addition to KDToolBox. That’s the version you should be using.

Alternatively, it should be possible, even in Qt projects, to use the Qt 6 version, which I have recently improved for the Qt 6.3 release next year, simply by copying its header file into your project and replacing the QHasher internal struct (which depends on implementation details of Qt 6’s QHash container) with a template argument.

I invite you to study the source-code and post any questions that you may have in the comments below.

Happy duplicate tracking!

About KDAB

If you like this article and want to read similar material, consider subscribing via our RSS feed.

Subscribe to KDAB TV for similar informative short video content.

KDAB provides market leading software consulting and development services and training in Qt, C++ and 3D/OpenGL. Contact us.

Categories: C++ / KDAB Blogs / KDAB on Qt / Optimization / QtDevelopment / Technical

Fast Duplicate Tracking DuplicateTracker Merged to KDToolBox

Double Lookup

Allocations, Allocations

Enter monotonic_buffer_resource

Tying It All Together

Enter `monotonic_buffer_resource`