A little hidden gem: QStringIterator

27 January 2020

Tags

A few days ago Marc Mutz, colleague of mine at KDAB and also author in this blog, spotted this function from Qt's source code (documentation):

/*!
    Returns \c true if the string only contains uppercase letters,
    otherwise returns \c false.
*/
bool QString::isUpper() const
{
    if (isEmpty())
        return false;

    const QChar *d = data();

    for (int i = 0, max = size(); i < max; ++i) {
        if (!d[i].isUpper())
            return false;
    }

    return true;
}

Apart from the mistake of considering empty strings not uppercase, which can be easily fixed, the loop in the body looks innocent enough. How would we figure out if a string only contains uppercase letters (as per the documentation in the snippet), anyhow?

Look at the string character by character;
If we see a non-uppercase character, the string is not uppercase;
Otherwise, it is uppercase.

That's exactly what the for loop in the code above is doing, right?

Well, no.

The code above is broken.

It falls into the same trap of endless other similar code: it doesn't take into account that QString does not contain characters/code points, but rather UTF-16 code units.

All operations on a QString (getting the length, splitting, iterating, etc.) always work in terms of UTF-16 code units, not code points. The reality is: QString is Unicode-aware only in some of its algorithms; certainly not in its storage.

For instance, if a string contains simply the character "𝐀" -- that is, MATHEMATICAL BOLD CAPITAL A (U+1D400) -- then its QString storage would actually contain 2 "characters" reported by size() (again, really, not characters in the sense of code points but two UTF-16 code units): 0xD835 and 0xDC00.

The naïve iteration done above would then check whether those two code units are uppercase, and guess what, they're not; and therefore conclude that the string is not uppercase, while instead it is. (Those two code units are "special" and used to encode a character outside the BMP; they're called a surrogate pair. When taken alone, they're invalid.)

Wherefore art thou, Unicode?

If you want to know more about what all of this Unicode story is about, please take a few minutes and read this and this. The resources linked are also good reads.

The problem of Unicode-aware iteration over string data is so common and frequent that back in 2014 I contributed a new class to Qt to solve it. The class is called, unsurprisingly, QStringIterator.

From its own documentation:

QStringIterator is a Java-like, bidirectional, const iterator over the contents of a QString. Unlike QString's own iterators, which manage the individual UTF-16 code units, QStringIterator is Unicode-aware: it will transparently handle the surrogate pairs that may be present in a QString, and return the individual Unicode code points.

Any code that walks over the contents of a QString should consider using QStringIterator, therefore preventing all such possible mistakes as well as leaving the burden of decoding UTF-16 into a series of code points into Qt. Indeed, QStringIterator is now used in many critical places inside Qt (text encoding, font handling, text classes, etc.).

How do I use it?

For various reasons (see below) QStringIterator is private API at the moment. Code that wants to use it has to include its header and enable the usage of private Qt APIs, for instance like this by using qmake:

QT += core-private

Or similarly with CMake:

target_link_libraries(my_target Qt5::CorePrivate)

Then we can include it, and use it to properly implement isUpper():

#include <private/qstringiterator_p.h>

bool QString::isUpper() const
{
    QStringIterator it(*this);

    while (it.hasNext()) {
        uint c = it.next();
        if (!QChar::isUpper(c))
            return false;
    }

    return true;
}

The call to next() will read as many code units are necessary to fully decode the next code point, and it will also do error checking.

(In this case it will return U+FFFD (REPLACEMENT CHARACTER), which has the nice property of not being uppercase, therefore making the function return false. But this is an implementation detail; calling QString algorithms on a string that contains illegal UTF-16 encoded data is unspecified behavior already, so don't do it.)

QStringIterator's API is quite rich; it supports bidirectional iteration, some customization of what should happen in case of decoding failure, as well as unchecked iteration (iteration that assumes that the QString contents are valid UTF-16; this allows skipping some checks).

That's it, no more excuses, start using QStringIterator today!

Regarding the QString::isUpper() function that we started this journey with: trying to fix it caused quite a discussion during code review, as you can see here and here.

Why isn't QStringIterator public API?

There are a few reasons why I am keeping QStringIterator as private API. It's not because its API is in constant evolution -- actually, it has not changed significantly in the past 6 years. QStringIterator even has complete documentation, tests and examples (the documentation is readable here).

From my personal point of view:

The API would benefit from a serious uplifting, becoming more C++ oriented, and way less Java oriented.

Rather than writing this:

QStringIterator i(str);
while (i.hasNext())
  use(i.next());

one should also be able to write something like this:

// C++11
for (auto cp : QStringIterator(str))
  use(cp);
 
// C++20
auto stringLenInCodePoints = std::ranges::distance(QStringIterator(str));
bool stringIsUpperCase = std::ranges::all_of(QStringIterator(str), &QChar::isUpper);
 
// C++20 + P1206
auto decodedString = QStringIterator(str) | std::ranges::to<QVector<uint>>;

None of the required APIs to make this possible exist at the moment -- QStringIterator is neither a range nor an iterable type.

Making it so opens up many, many API problems: e.g. minor things whether if QStringIterator is a good name, given it yields out iterators; to huge design problems, like how to add customization points to decide how to handle strings containing malformed UTF-16 data (skip? replace? stop? throw an exception?).

The implementation is optimized for clarity, not raw speed.

At the moment, it doesn't use SIMD or any similar intrisics. I strongly feel that it may benefit from such improvements, if we redesign its API (e.g. making the failure mode a customization point).

There is other, similar, more general purpose work happening elsewhere.

For instance, in the glorious ICU libraries, in the work happening in the SG16 WG21 study group, in the proposed Boost.Text, and so on. We may just decide to use the results of some of that work, rather than coming up with a Qt-specific way of using a particular algorithm (UTF-16 decoding).

Unicode is complicated, and we may have forgotten to handle some corner case properly.

If we set QStringIterator's API/ABI in stone (by making it public), we risk ending up with our hands tied for future necessary expansion.

Most of Qt assumes valid UTF-16 content in QStrings (see the comment above).

We need a project-wide decision on how to actually detect and tackle invalid UTF-16 content, and enforce it consistently. QStringIterator should therefore follow such decision, and that becomes very hard if we're again constrained by the public API promise.

With all of this in mind, I am not comfortable with committing QStringIterator as public API at the moment. But again, it doesn't mean that you can't use it in your code today, and maybe submit some feedback.

Happy hacking!

Tags:

c++qml qt

About KDAB

The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.

2 Comments

QStringIterator - private, undocumented, unstable. Using it is bad or even worse.

The fact is that nobody has a good solution. Does Qt6 solve this? I'm 99% sure it doesn't.

Hi,

Why the FUD about this? Yes, QStringIterator is a private class; the whole point of this blog post is to raise awareness about it.

"Undocumented" means no public documentation, because... it's a private class. It does not mean that comprehensive documentation about it does not exist; I wrote it with the idea that the class could become public some day: https://github.com/qt/qtbase/blob/dev/src/corelib/text/qstringiterator.qdoc

"Unstable", "using it is bad": where does that assertion come from? If anything, it's one of the most stable classes in Qt, having had maybe just one significant API change in the last 8 years. These are all the commits on it:

So while I perfectly understand the frustation at not having ready-made classes for Unicode iteration (... which is why I wrote QStringIterator in the first place, and why I then wrote this blog post), these aren't substantive critics :)

Giuseppe D’Angelo

Senior Software Engineer

Senior Software Engineer at KDAB. Giuseppe is a long-time contributor to Qt, having used Qt and C++ since 2000, and is an Approver in the Qt Project. His contributions in Qt range from containers and regular expressions to GUI, Widgets, and OpenGL. A free software passionate and UNIX specialist, before joining KDAB, he organized conferences on opensource around Italy. He holds a BSc in Computer Science.