Skip to content

A little hidden gem: QStringIterator

A few days ago Marc Mutz, colleague of mine at KDAB and also author in this blog, spotted this function from Qt’s source code (documentation):

/*!
    Returns \c true if the string only contains uppercase letters,
    otherwise returns \c false.
*/
bool QString::isUpper() const
{
    if (isEmpty())
        return false;

    const QChar *d = data();

    for (int i = 0, max = size(); i < max; ++i) {
        if (!d[i].isUpper())
            return false;
    }

    return true;
}

Apart from the mistake of considering empty strings not uppercase, which can be easily fixed, the loop in the body looks innocent enough. How would we figure out if a string only contains uppercase letters (as per the documentation in the snippet), anyhow?

  • Look at the string character by character;
  • If we see a non-uppercase character, the string is not uppercase;
  • Otherwise, it is uppercase.

That’s exactly what the for loop in the code above is doing, right?

Well, no.

The code above is broken.

It falls into the same trap of endless other similar code: it doesn’t take into account that QString does not contain characters/code points, but rather UTF-16 code units.

All operations on a QString (getting the length, splitting, iterating, etc.) always work in terms of UTF-16 code units, not code points. The reality is: QString is Unicode-aware only in some of its algorithms; certainly not in its storage.

For instance, if a string contains simply the character “𝐀” — that is, MATHEMATICAL BOLD CAPITAL A (U+1D400) — then its QString storage would actually contain 2 “characters” reported by size() (again, really, not characters in the sense of code points but two UTF-16 code units): 0xD835 and 0xDC00.

The naïve iteration done above would then check whether those two code units are uppercase, and guess what, they’re not; and therefore conclude that the string is not uppercase, while instead it is. (Those two code units are “special” and used to encode a character outside the BMP; they’re called a surrogate pair. When taken alone, they’re invalid.)

Wherefore art thou, Unicode?

If you want to know more about what all of this Unicode story is about, please take a few minutes and read this and this. The resources linked are also good reads.

The problem of Unicode-aware iteration over string data is so common and frequent that back in 2014 I contributed a new class to Qt to solve it. The class is called, unsurprisingly, QStringIterator.

From its own documentation:

QStringIterator is a Java-like, bidirectional, const iterator over the contents of a QString. Unlike QString’s own iterators, which manage the individual UTF-16 code units, QStringIterator is Unicode-aware: it will transparently handle the surrogate pairs that may be present in a QString, and return the individual Unicode code points.

Any code that walks over the contents of a QString should consider using QStringIterator, therefore preventing all such possible mistakes as well as leaving the burden of decoding UTF-16 into a series of code points into Qt. Indeed, QStringIterator is now used in many critical places inside Qt (text encoding, font handling, text classes, etc.).

How do I use it?

For various reasons (see below) QStringIterator is private API at the moment. Code that wants to use it has to include its header and enable the usage of private Qt APIs, for instance like this by using qmake:

QT += core-private

Or similarly with CMake:

target_link_libraries(my_target Qt5::CorePrivate)

Then we can include it, and use it to properly implement isUpper():

#include <private/qstringiterator_p.h>

bool QString::isUpper() const
{
    QStringIterator it(*this);
 
    while (it.hasNext()) {
        uint c = it.next();
        if (!QChar::isUpper(c))
            return false;
    }

    return true;
}

The call to next() will read as many code units are necessary to fully decode the next code point, and it will also do error checking.

(In this case it will return U+FFFD (REPLACEMENT CHARACTER), which has the nice property of not being uppercase, therefore making the function return false. But this is an implementation detail; calling QString algorithms on a string that contains illegal UTF-16 encoded data is unspecified behavior already, so don’t do it.)

QStringIterator‘s API is quite rich; it supports bidirectional iteration, some customization of what should happen in case of decoding failure, as well as unchecked iteration (iteration that assumes that the QString contents are valid UTF-16; this allows skipping some checks).

That’s it, no more excuses, start using QStringIterator today!

Regarding the QString::isUpper() function that we started this journey with: trying to fix it caused quite a discussion during code review, as you can see here and here.

Why isn’t QStringIterator public API?

There are a few reasons why I am keeping QStringIterator as private API. It’s not because its API is in constant evolution — actually, it has not changed significantly in the past 6 years. QStringIterator even has complete documentation, tests and examples (the documentation is readable here).

From my personal point of view:

  • The API would benefit from a serious uplifting, becoming more C++ oriented, and way less Java oriented. Rather than writing this:
    QStringIterator i(str);
    while (i.hasNext())
      use(i.next());
    

    one should also be able to write something like this:

    // C++11
    for (auto cp : QStringIterator(str))
      use(cp);
    
    // C++20
    auto stringLenInCodePoints = std::ranges::distance(QStringIterator(str));
    bool stringIsUpperCase = std::ranges::all_of(QStringIterator(str), &QChar::isUpper);
    
    // C++20 + P1206
    auto decodedString = QStringIterator(str) | std::ranges::to<QVector<uint>>;
    

    None of the required APIs to make this possible exist at the moment — QStringIterator is neither a range nor an iterable type.

    Making it so opens up many, many API problems: e.g. minor things whether if QStringIterator is a good name, given it yields out iterators; to huge design problems, like how to add customization points to decide how to handle strings containing malformed UTF-16 data (skip? replace? stop? throw an exception?).

  • The implementation is optimized for clarity, not raw speed. At the moment, it doesn’t use SIMD or any similar intrisics. I strongly feel that it may benefit from such improvements, if we redesign its API (e.g. making the failure mode a customization point).
  • There is other, similar, more general purpose work happening elsewhere. For instance, in the glorious ICU libraries, in the work happening in the SG16 WG21 study group, in the proposed Boost.Text, and so on. We may just decide to use the results of some of that work, rather than coming up with a Qt-specific way of using a particular algorithm (UTF-16 decoding).
  • Unicode is complicated, and we may have forgotten to handle some corner case properly. If we set QStringIterator‘s API/ABI in stone (by making it public), we risk ending up with our hands tied for future necessary expansion.
  • Most of Qt assumes valid UTF-16 content in QStrings (see the comment above). We need a project-wide decision on how to actually detect and tackle invalid UTF-16 content, and enforce it consistently. QStringIterator should therefore follow such decision, and that becomes very hard if we’re again constrained by the public API promise.

With all of this in mind, I am not comfortable with committing QStringIterator as public API at the moment. But again, it doesn’t mean that you can’t use it in your code today, and maybe submit some feedback.

Happy hacking!

About KDAB

If you like this blog and want to read similar articles, consider subscribing via our RSS feed.

Subscribe to KDAB TV for similar informative short video content.

KDAB provides market leading software consulting and development services and training in Qt, C++ and 3D/OpenGL. Contact us.

FacebookTwitterLinkedInEmail

Categories: C++ / KDAB Blogs / KDAB on Qt / Qt / Technical

Leave a Reply

Your email address will not be published. Required fields are marked *

By continuing to use the site, you agree to the use of cookies. More information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close