QStringView Diaries: The Eagle Has Landed

QStringView merged for Qt 5.10

6 April 2017

QStringView: A std::string_view for QString

If you never heard of std::string_view, you may want to learn about it in Marshall Clow's CppCon 2015 presentation.

TL;DR: String-views reduce temporary allocations.

Yours truly is not generally known to support reimplementing std facilities in Qt. So you might legitimately ask: "Why QStringView? Why not just use std::basic_string_view<QChar>?". The answer is the same as for QString itself. QString simply has a lot going for it that std::string is lacking. First and foremost, it has excellent Unicode support. So reimplementing std::string_view for QString/QChar is really a no-brainer.

QStringView tries to solve the problem that functions outside the very core of QString only take QString. There are usually not even QLatin1String overloads, even though most users pass just US-ASCII string literals to these functions. Sure, if you compile without QT_NO_CAST_FROM_ASCII, then just passing "foo" to a function taking QString works just fine.

But the use of QString has a cost: it allocates dynamic memory, and that is comparatively slow. For a string class, it has also fallen a bit behind the state of the art. It uses copy-on-write/implicit sharing, which developers outside Qt no longer consider an optimisation. It also does not use the small-string optimisation, which stores small strings in the object itself instead of in dynamic memory. That makes QString("OK") or QString("Cancel") much more expensive than it should be.

Enter QStringView

This is where string-views come in. QStringView is designed as a read-only view on QStrings and QString-like objects. QString-like are classes such as QStringRef, std::u16string, and char16_t literals (u"Hello"). This is useful, since a lot of functions that take QString do not need an actual QString. That is, they do not need an owning reference to the characters. They only need a weak reference: a non-owning pointer and a size, say. Or a pointer pair acting as iterators. And indeed, a lot of low-level functions take (const QChar* data, int length). In doing so, they do not require the construction of a QString just to iterate over its characters.

bool isValidIdentifier(const QChar *data, int len) {
    if (!data || len &lt;= 0)
        return false;
    if (!data-&gt;isLetter())
        return false;
    --len;
    ++data;
    while (len) {
        if (!data-&gt;isLetterOrNumber())
            return false;
        ++data;
        --len;
    }
    return true;
}

Using pointer-and-length APIs has a cost, too, though.

Towards wide contracts in low-level string APIs

Such functions have preconditions. We say they have a narrow contract. Only certain combinations of the two parameters are allowed: The length must be non-negative, and the pointer mustn't be nullptr unless the length is zero, too.

If a function takes a QString instead, it has no preconditions. We say it has a wide contract: any QString is generally acceptable, and valid.

QStringView combines the efficiency and QString-independence of pointer-and-length APIs with the conceptual clarity of QString APIs. By passing an object of class type, we can (and do) enforce invariants between these parameters. Constructing a string-view with a negative length is undefined behaviour. And that is caught at string-view construction time (with an assertion in debug mode). Before the function is entered. This way, we put the onus of checking for valid parameters on the caller. So far, nothing changed compared to the pointer-and-size case. But the function can now assume that its QStringView argument references valid data.

Practically speaking, this means that functions taking QStringView can be marked as noexcept while functions that take pointer-and-size cannot. At least if you buy into the rule that narrow-contract functions mustn't be noexcept (which both the standard and Qt libraries do).

bool isValidIdentifier(QStringView id) noexcept {
    if (id.isEmpty())
        return false;
    if (!id.front().isLetter())
        return false;
    for (QChar ch : id.mid(1)) {
        if (!ch.isLetterOrNumber())
            return false;
    }
    return true;
}

A (nearly) universal string-data sink

The most thrilling property of QStringView, however, is the wide variety of arguments with which you can construct one. Not only does it abstract away the container used to hold the character data: Whether your string data is stored in a QString, a QStringRef, a std::u16string or a std::u16string_view, QStringView won't care. It also abstracts away the plethora of character types Qt uses. It does not distinguish between QChar, ushort, char16_t or (on platforms, such a Windows, where it is a 2-byte type) wchar_t. It swallows any of those without a cast:

bool isValidIdentifier(QStringView id);
isValidIdentifier(u&quot;QString&quot;);                // OK
isValidIdentifier(L&quot;QString&quot;);                // OK (on Windows only)
isValidIdentifier(QStringLiteral(&quot;QString&quot;)); // OK
QString fun = &quot;QString::left()&quot;;
isValidIdentifier(fun.leftRef(7));            // OK
isValidIdentifier(u&quot;QString&quot;s);               // OK
isValidIdentifier(L&quot;QString&quot;s);               // OK (on Windows only)

QStringView does not completely replace QString as an argument type, however. There are some (expensive-to-convert) argument types QString allows, but QStringView doesn't. Your QString function will happily accept a QChar or a QLatin1String, too. QStringView doesn't. If you use QStringBuilder (as you should), then your QString function can be called with a QStringBuilder expression. QStringView only accepts this with a manual cast to QString: f(QString(expr)).

Future

By Qt 5.10, we'd like a QStringView which has most if not all of the const QString API. There are some notable exceptions we already know about: we will not add a split() method. One of the reasons to use a string-view is to enable zero-allocation parsing. The split() function, however, returns a dynamically-sized container of substrings. We intend to replace this functionality with a QStringTokenizer class. Taking the same arguments as QString::split(), it will have a container interface that allows you to plug it into a ranged for-loop:

QString s = ...;
for (QStringView part : QStringTokenizer(s, u'\n'))
    use(part);

We will also co-evolve QLatin1String together with QStringView, making QLatin1String as full-blown a view type for chars as QStringView is for QChars.

You can follow QStringView development on this blog and on Gerrit.

Stay tuned!

Tags:

c++qt

18 Comments

What's the reason that makes passing a const QStringView& worse than passing it by value? Indirection?

See Chandler Carruth's BoostCon 2013 presentation. Or any other Chandler Carruth presentation ever given :)

TL;DR: Pass by value takes memory out of the picture, simplifying the optimiser's job considerably: values are not forced on the stack with most platform ABIs, and aliasing is not a problem.

Interesting. I thought that a Qt's equivalent to std::string_view is QStringRef class. It would be nice if you'd explain key differences between QStringRef and QStringView.

Indeed, thanks for the suggestion.

In all brevity: QStringRef cannot reference non-QString-backed data, because it holds a const QString*, a position and a length inside that string. QStringView, otoh, is just a pointer to the character data and a size, and thus agnostic to the owning container. It may, but does not have to be a QString.

So extending QStringRef instead of introducing a new type would keep the Qt API cleaner. Have you considered it?

QStringRef has certain guarantees (it's stable under reallocations of it's string()) that were specifically designed into it. If I were to re-use QStringRef for what QStringView is designed to solve, I would have to do the whole work as an almost-atomic operation between Qt 5 and Qt 6. And I'd still break existing out-of-tree users in the process. I wanted something that was possible to implement here and now, and less disruptive.

Another question: can QStringView work with a kind of string where the data is not contiguous? Say that one needs to implement a text editor, and considers storing the edited text as a gap buffer, rope, sequence of lines, or whatever.

I also had the question of whether this could work with QStringIterator, but I saw one commit that made use of it, so it seems that yes.

Thanks.

QStringView, like std::string_view, expects characters to be contiguous. It cannot represent a rope.

QStringIterator is already ported to QStringView, yes, but since it already sported a (QChar*, QChar*) constructor, you could've passed (and can still pass) begin() and end() of a QStringView even if it wasn't.

Maybe it's a stupid question, but... As far as I understand, the QStringView fixes performance issues with the QString. Then why just not fix the QString itself?

A string-view is conceptually similar, if not identical, to the STL design of separating algorithms from containers by having containers provide, and algorithms work with, iterators. A function taking a string-view is an algorithm on characters. The string-view is the iterator pair, and which container the algorithm works on is abstracted. Only, because we're working with a rather restricted set of value types and only contiguous memory, we don't need to write our algorithms as template functions. A normal function taking QStringView will do, because const QChar* is always the iterator.

As for fixing QString: There are many things that I'd like to see fixed in QString, and I've mentioned them in the article. But a string class needs to hold strings of arbitrary size. So it must (eventually) allocate memory, and own it. That makes QString a container and fundamentally different from a string-view.

I think I fail to appreciate your section regarding preconditions. You write

Such functions have preconditions. We say they have a narrow contract. Only certain combinations of the two parameters are allowed: The length must be non-negative, and the pointer mustn’t be nullptr unless the length is zero, too.

However, the 'isValidIdentifier' function appears to be a total function. A null pointer and negative lengths are perfectly fine as it is, and the function (probably rightfully so) rejects them as valid identifiers.

You proceed to state that

If a function takes a QString instead, it has no preconditions. We say it has a wide contract: any QString is generally acceptable, and valid.

However, you'd surely test for a QString to be non-empty (i.e. the equivalent of your previous len <= 0 test) before proceeding, no?

I believe your point would be better made if you assert(!) that the data pointer is non-null. This would make it a partial function, and data being non-null would clearly be a precondition. This would also nicely lead to showing how a real QString does away with this precondition since you now pass a reference which cannot be null.

The traditional isValidIdentifier() is not a total function:

auto id = u"Hello";
if (isValidIdentifier(cast...(id), 15)) // ERROR: precondition violation
                                        // [ptr, len) is not a valid range

And neither is QStringView's constructor taking the same arguments:

auto sv = QStringView{u"Hello", 15}; // ERROR: precondition violation
                                     // [ptr, len) is not a valid range

Consequently, that constructor is not noexcept.

This is subtle, I know: If isValidIdentifier() is ported to QStringView it becomes a total function:

auto id = u"Hello";
if (isValidIdentifier(QStringView(id, 15)) // ERROR: precondition violation
                                           // [ptr, len) is not a valid range
                                           // _while constructing QStringView_!

Crucially, the UB now happens outside the function, in the QStringView constructor, just as in the second example.

If you think there's no difference, consider this: If I have some sanitizer API that would allow me to assert that a [ptr, len) range is valid, I could report the error. In the traditional case, I'd need to detect and report it inside isValidIdentifier() (and in all other such functions). With QStringView, it's detected and reported from the QStringView ctor.

If you think there’s no difference, consider this: If I have some sanitizer API that would allow me to assert that a [ptr, len) range is valid, I could report the error. In the traditional case, I’d need to detect and report it inside isValidIdentifier() (and in all other such functions). With QStringView, it’s detected and reported from the QStringView ctor.

https://codereview.qt-project.org/193707

Yes, I think I see what you're getting at - it makes perfect sense.

I suspect it may just be my lack of experience with QStringView which keeps me from acknowledging that using QStringView actually makes isValidIdentifier a total function. Or maybe it's because my idea of what constitutes a 'precondition' differs from yours (to me, it's a pre-condition of a piece of code which is not currently expressed in the type system but which has to be asserted at runtime).

My understanding is that QStringView itself does not verify that the given range is valid. It also doesn't create a copy of the data. Hence, the precondition on isValidIdentifier is still that a valid range is passed. It's just that instead of passing a starting address and a length, a QStringView is passed - but there's nothing in the type system or in the constructor of QStringView which can enforce that the given QStringView denotes a valid range. I.e. the QStringView constructor still permits constructing invalid ranges, so there is a (wide) range of invalid QStringView objects possible for which isValidIdentifier is not well-defined, and hence partial.

Unfortunately I cannot seem to figure out how to do syntax highlighting in this blog, but to give an example of what I mean: If you consider this function for getting the first character of a string to be partial:

QChar firstCharacter( const QString &amp;s ) {
  return s[0]; // Oops - what if it's an empty string?
}

...then this might be a way to make it a total function, by using the type system and enforcing the precondition in the constructor:

struct NonEmptyString {
  QString value;
  NonEmptyString( const QString &amp;s ) {
    if ( s.isEmpty() ) {
      throw std::logic_error( "empty string passed" );
    }
  }
};

QChar firstCharacter( const NonEmptyString &amp;s ) {
  return s.value[0]; // Fine - no NonEmptyString object ever exists having an empty value member
}

My impression is that QStringView does not give any such guarantees since the behaviour of QStringView's constructor is undefined for empty ranges.

A pre-condition of a function is a condition that needs to be true on the arguments of the function in order for the function to realise its post-conditions. Calling a function without all pre-conditions met is undefined behaviour. You seem to want it have defined behaviour. Here's why that's a fallacy:

Yes, ideally, a precondition would be a predicate in the same language as the function. But that is frequently not possible, or even if it is possible, it's not desirable to check it.

E.g. std::lower_bound(first, last, value, cmp) has the following pre-conditions, which are usually not assertable:

[first, last) is a valid range

Not checkable in C++, can be checked with a Valgrind hook for contiguous iterators. Cannot be checked at all for non-contiguous iterators (e.g. QList::iterator)
[first, last) is sorted according to cmp

Checkable, but the check is O(N) (and changes the requirements on cmp!) while the functionality is O(logN), so usually not asserted.
cmp is a strict weak ordering

Not checkable, unless by exhaustive testing, which is only possible for fixed-size Domain(cmp). If cmp is calling strcmp(), then you've lost.

But even though these are not checkable (or too expensive to check), they're still pre-conditions of the function, and failing to meet them means the function will not reliably meet its post-conditions.

So, the QStringView ctor checks the cheap preconditions, but fails to check the expensive or uncheckable ones. That doesn't mean you're free to create a QStringView with too large size. You're still violating preconditions, and you're still invoking UB.

Did you see https://codereview.qt-project.org/193707 ?

I concur with every word you write, but I feel I'm drawing a different conclusion. You are of course perfectly right when you say that it's frequently not possible (or practical) to assert preconditions in code. std::lower_bound is a good example.

In the same way, the traditional isValidIdentifier function had preconditions which code cannot easily verify (that the given range is valid). And indeed, QStringView inherits this behaviour in that the QStringView has no practical way to verify that the given range is valid. I believe that so far, we're on the same page.

Now, the conclusion I draw from this (and I think this is where we diverge) is that the new, QStringView-based definition of isValidIdentifier is still as easy (or hard) to call correctly as before. The contract is as wide (or narrow) as before - since the function still asserts that the given view denotes a valid range. A QStringView is isomorphic to a char */int tuple, and the set of QStringView objects which violate the precondition is as large as the set of parameter combinations with which the traditional isValidIdentifier function must not be called.

In the same vein, in my understanding, the contract of std::lower_bound would not change if instead of

std::lower_bound(first, last, value, cmp)

it would be declared as e.g.

std::lower_bound(range, value, cmp)

With range being something like

  template <typename Iterator>
  struct Range {
    Iterator start;
    Iterator end;
  };

The contract of std::lower_bound will still include that you have to pass a valid range. It just happens that the range is no longer expressed as two iterators but as a Range object (which does not enforce that the range is valid!).

You pointed out that with the traditional definition of isValidIdentifier "Only certain combinations of the two parameters are allowed". My understanding is that the same holds true if isValidIdentifier is given a QStringView - only certain QStringView objects (those which denote a valid range) are allowed. It would be a very different story if isValidIdentifier was given a real QString, which took a copy of the data.

That's what I meant to express with my NonEmptyString example: by raising an exception in the constructor (and copying the data), it's actually impossible to construct objects which transport empty strings. So the set of possible NonEmptyString is actually smaller than the set of all possible std::string objects, and hence the contract of the firstCharacter function becomes, in your terms, wider when using NonEmptyString.

Regarding https://codereview.qt-project.org/193707 -- yes, I did notice it, but I didn't quite understand what's going on. I'm not too familiar with Valgrind unfortunately.

A QStringView is isomorphic to a char */int tuple, [...]

This is where we disagree, indeed. A QStringView is a new type. It has (const QChar*, qqsize_t) members, yes, but it is not isomorphic to a tuple made out of its data members any more than a QString is isomorphic to a QString::Data*. That's because a C++ class can, and usually does, introduce class invariants. The constructor is responsible for establishing the class invariant, but frequently, if it depends on user-provided arguments, the extent to which it can guarantee that all class invariants have been successfully established is limited. This is where the C++ standard resorts to the phrase "undefined behaviour, no diagnostic required". And a normal C++ class can do the same.

That said, the change I linked attempts to do away with the "no diagnostic" part, by using a Valgrind hook to check, at QStringView construction time, whether the given range contains valid data.

Now, crucially, any function has an implicit pre-condition that its arguments are valid values of the arguments' types. In your definition of a total function, that means such functions cannot have arguments of types that have class invariants. Ok, if that is the definition of a total function then functions taking QStringView are never total, indeed. Neither is a function taking NonEmptyString total, though, since NonEmptyString clearly has class invariants, too.

I just noticed that since the NonEmptyString constructor never actually initializes the value member, the comment in the second firstCharacter definition is very misleading; every NonEmptyString object will have an empty value member. Oops. :-)