Sign up for the KDAB Newsletter
Stay on top of the latest news, publications, events and more.
Go to Sign-up
A few days ago Marc Mutz, colleague of mine at KDAB and also author in this blog, spotted this function from Qt's source code (documentation):
/*!
Returns \c true if the string only contains uppercase letters,
otherwise returns \c false.
*/
bool QString::isUpper() const
{
if (isEmpty())
return false;
const QChar *d = data();
for (int i = 0, max = size(); i < max; ++i) {
if (!d[i].isUpper())
return false;
}
return true;
}
Apart from the mistake of considering empty strings not uppercase, which can be easily fixed, the loop in the body looks innocent enough. How would we figure out if a string only contains uppercase letters (as per the documentation in the snippet), anyhow?
That's exactly what the for
loop in the code above is doing, right?
Well, no.
The code above is broken.
It falls into the same trap of endless other similar code: it doesn't take into account that QString
does not contain characters/code points, but rather UTF-16 code units.
All operations on a QString
(getting the length, splitting, iterating, etc.) always work in terms of UTF-16 code units, not code points. The reality is: QString
is Unicode-aware only in some of its algorithms; certainly not in its storage.
For instance, if a string contains simply the character "𝐀" -- that is, MATHEMATICAL BOLD CAPITAL A (U+1D400) -- then its QString
storage would actually contain 2 "characters" reported by size() (again, really, not characters in the sense of code points but two UTF-16 code units): 0xD835 and 0xDC00.
The naïve iteration done above would then check whether those two code units are uppercase, and guess what, they're not; and therefore conclude that the string is not uppercase, while instead it is. (Those two code units are "special" and used to encode a character outside the BMP; they're called a surrogate pair. When taken alone, they're invalid.)
If you want to know more about what all of this Unicode story is about, please take a few minutes and read this and this. The resources linked are also good reads.
The problem of Unicode-aware iteration over string data is so common and frequent that back in 2014 I contributed a new class to Qt to solve it. The class is called, unsurprisingly, QStringIterator.
From its own documentation:
QStringIterator is a Java-like, bidirectional, const iterator over the contents of a QString. Unlike QString's own iterators, which manage the individual UTF-16 code units, QStringIterator is Unicode-aware: it will transparently handle the surrogate pairs that may be present in a QString, and return the individual Unicode code points.
Any code that walks over the contents of a QString
should consider using QStringIterator
, therefore preventing all such possible mistakes as well as leaving the burden of decoding UTF-16 into a series of code points into Qt. Indeed, QStringIterator
is now used in many critical places inside Qt (text encoding, font handling, text classes, etc.).
For various reasons (see below) QStringIterator
is private API at the moment. Code that wants to use it has to include its header and enable the usage of private Qt APIs, for instance like this by using qmake:
QT += core-private
Or similarly with CMake:
target_link_libraries(my_target Qt5::CorePrivate)
Then we can include it, and use it to properly implement isUpper()
:
#include <private/qstringiterator_p.h>
bool QString::isUpper() const
{
QStringIterator it(*this);
while (it.hasNext()) {
uint c = it.next();
if (!QChar::isUpper(c))
return false;
}
return true;
}
The call to next()
will read as many code units are necessary to fully decode the next code point, and it will also do error checking.
(In this case it will return U+FFFD (REPLACEMENT CHARACTER), which has the nice property of not being uppercase, therefore making the function return false. But this is an implementation detail; calling QString
algorithms on a string that contains illegal UTF-16 encoded data is unspecified behavior already, so don't do it.)
QStringIterator
's API is quite rich; it supports bidirectional iteration, some customization of what should happen in case of decoding failure, as well as unchecked iteration (iteration that assumes that the QString contents are valid UTF-16; this allows skipping some checks).
That's it, no more excuses, start using QStringIterator today!
Regarding the QString::isUpper()
function that we started this journey with: trying to fix it caused quite a discussion during code review, as you can see here and here.
There are a few reasons why I am keeping QStringIterator
as private API. It's not because its API is in constant evolution -- actually, it has not changed significantly in the past 6 years. QStringIterator
even has complete documentation, tests and examples (the documentation is readable here).
From my personal point of view:
Rather than writing this:
QStringIterator i(str);
while (i.hasNext())
use(i.next());
one should also be able to write something like this:
// C++11
for (auto cp : QStringIterator(str))
use(cp);
// C++20
auto stringLenInCodePoints = std::ranges::distance(QStringIterator(str));
bool stringIsUpperCase = std::ranges::all_of(QStringIterator(str), &QChar::isUpper);
// C++20 + P1206
auto decodedString = QStringIterator(str) | std::ranges::to<QVector<uint>>;
None of the required APIs to make this possible exist at the moment -- QStringIterator
is neither a range nor an iterable type.
Making it so opens up many, many API problems: e.g. minor things whether if QStringIterator
is a good name, given it yields out iterators; to huge design problems, like how to add customization points to decide how to handle strings containing malformed UTF-16 data (skip? replace? stop? throw an exception?).
At the moment, it doesn't use SIMD or any similar intrisics. I strongly feel that it may benefit from such improvements, if we redesign its API (e.g. making the failure mode a customization point).
For instance, in the glorious ICU libraries, in the work happening in the SG16 WG21 study group, in the proposed Boost.Text, and so on. We may just decide to use the results of some of that work, rather than coming up with a Qt-specific way of using a particular algorithm (UTF-16 decoding).
If we set QStringIterator
's API/ABI in stone (by making it public), we risk ending up with our hands tied for future necessary expansion.
QString
s (see the comment above).We need a project-wide decision on how to actually detect and tackle invalid UTF-16 content, and enforce it consistently. QStringIterator
should therefore follow such decision, and that becomes very hard if we're again constrained by the public API promise.
With all of this in mind, I am not comfortable with committing QStringIterator
as public API at the moment. But again, it doesn't mean that you can't use it in your code today, and maybe submit some feedback.
Happy hacking!
About KDAB
The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.
Stay on top of the latest news, publications, events and more.
Go to Sign-up
Upgrade your applications from Qt 5 to Qt 6 with KDAB’s migration services. Get a free migration assessment and join a hands-on workshop to prepare your team for a successful transition!
Learn more
Learn Modern C++
Our hands-on Modern C++ training courses are designed to quickly familiarize newcomers with the language. They also update professional C++ developers on the latest changes in the language and standard library introduced in recent C++ editions.
Learn more
2 Comments
15 - Feb - 2022
Brian Warner
QStringIterator - private, undocumented, unstable. Using it is bad or even worse.
The fact is that nobody has a good solution. Does Qt6 solve this? I'm 99% sure it doesn't.
15 - Feb - 2022
Giuseppe D'Angelo
Hi,
Why the FUD about this? Yes, QStringIterator is a private class; the whole point of this blog post is to raise awareness about it.
"Undocumented" means no public documentation, because... it's a private class. It does not mean that comprehensive documentation about it does not exist; I wrote it with the idea that the class could become public some day: https://github.com/qt/qtbase/blob/dev/src/corelib/text/qstringiterator.qdoc
"Unstable", "using it is bad": where does that assertion come from? If anything, it's one of the most stable classes in Qt, having had maybe just one significant API change in the last 8 years. These are all the commits on it:
So while I perfectly understand the frustation at not having ready-made classes for Unicode iteration (... which is why I wrote QStringIterator in the first place, and why I then wrote this blog post), these aren't substantive critics :)