There and Back Again

what it entails to fix 10-year-old bugs in QNetworkAccessManager

12 October 2020

The problem

Using a network sniffer like Wireshark, the problem quickly emerged: each SOAP request, transported through a POST method, was sent by QNAM split over multiple TCP segments.

Let's analyze the packet dump together. We have 192.168.1.1, which is the client running KD Soap, and 192.168.1.254 running a HTTP server.

In the first few exchanged segments, nothing special happens:

No.     Time           Source                Destination           Protocol Length Info
      2  0.003375      192.168.1.1           192.168.1.254         TCP      66     55212 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1
      3  0.003504      192.168.1.254         192.168.1.1           TCP      66     80 → 55212 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=128
      4  0.003578      192.168.1.1           192.168.1.254         TCP      54     55212 → 80 [ACK] Seq=1 Ack=1 Win=2102272 Len=0
      5  0.003702      192.168.1.1           192.168.1.254         TCP      333    55212 → 80 [PSH, ACK] Seq=1 Ack=1 Win=2102272 Len=279 [TCP segment of a reassembled PDU]
      6  0.003781      192.168.1.254         192.168.1.1           TCP      60     80 → 55212 [ACK] Seq=1 Ack=280 Win=30336 Len=0
      7  0.003815      192.168.1.1           192.168.1.254         HTTP/XML 349    POST / HTTP/1.1
      8  0.003894      192.168.1.254         192.168.1.1           TCP      60     80 → 55212 [ACK] Seq=1 Ack=575 Win=31360 Len=0
      9  0.005749      192.168.1.254         192.168.1.1           HTTP/XML 655    HTTP/1.1 200 OK

In order:

Segments 2-4 are the typical TCP three-way handshake, connecting to the server.
Segment 5 contains the first part of the first SOAP request.
Segment 6 quickly acknowledges the data sent by segment 5.
Segment 7 completes the first SOAP request (that is, completes the HTTP POST).
Segment 8 acknowledges the data sent in segment 7. (Note: Please ignore the length field -- it does not refer to the TCP data length, but to the request length; that's why the acknowledged data does not match.)
Segment 9 sends the result back to the client.

The whole exchange (establishing the connection, sending the request, getting an answer) took less than 3 milliseconds. Pretty reasonable, considering everything is happening inside the same local network.

Now this is when things start going wrong. The client did not make just one request; it made several, in quick succession. These requests got queued inside QNAM. QNAM implements HTTP persistent connections, so the TCP connection is not immediately shut down after the request has been served. Instead, once QNAM receives the reply, it's able to immediately send another request. This happens in the next exchange:

No.     Time           Source                Destination           Protocol Length Info
     10  0.006527      192.168.1.1           192.168.1.254         TCP      333    55212 → 80 [PSH, ACK] Seq=575 Ack=602 Win=2101760 Len=279 [TCP segment of a reassembled PDU]
     11  0.049994      192.168.1.254         192.168.1.1           TCP      60     80 → 55212 [ACK] Seq=602 Ack=854 Win=32512 Len=0
     12  0.050063      192.168.1.1           192.168.1.254         HTTP/XML 349    POST / HTTP/1.1
     13  0.050269      192.168.1.254         192.168.1.1           TCP      60     80 → 55212 [ACK] Seq=602 Ack=1149 Win=33536 Len=0
     14  0.051334      192.168.1.254         192.168.1.1           HTTP/XML 655    HTTP/1.1 200 OK

Here we have:

Segment 10 acknowledges the data received by the server, and sends part of a new request.
Segment 11 acknowledges the data for the new request.
Segment 12 completes the request (sending more data).
Segment 13 acknowledges the data sent to complete the request.
Segment 14 sends the result back to the client.

It looks pretty much like the last exchange, minus establishing the connection, doesn't it? There's however a small but important difference: if you carefully look at the timestamps, you may notice that this request took almost 50ms to be served! That's 10-20 times more than the first one!

If we keep going, we have another request, and this one also takes 50ms:

No.     Time           Source                Destination           Protocol Length Info
     15  0.052316      192.168.1.1           192.168.1.254         TCP      333    55212 → 80 [PSH, ACK] Seq=1149 Ack=1203 Win=2100992 Len=279 [TCP segment of a reassembled PDU]
     16  0.099992      192.168.1.254         192.168.1.1           TCP      60     80 → 55212 [ACK] Seq=1203 Ack=1428 Win=34560 Len=0
     17  0.100047      192.168.1.1           192.168.1.254         HTTP/XML 349    POST / HTTP/1.1
     18  0.100315      192.168.1.254         192.168.1.1           TCP      60     80 → 55212 [ACK] Seq=1203 Ack=1723 Win=35712 Len=0
     19  0.101336      192.168.1.254         192.168.1.1           HTTP/XML 655    HTTP/1.1 200 OK

These delays would just keep piling up; in order to send 20 requests we take over one second, rather than less than 0.1s (as it should be). What's going on here? Where do these delays come from?

Diagnosis

As hinted in the bug report, the problem is a very subtle interaction between QNAM and the underlying TCP stack. The TCP protocol itself, as well as the API that operating systems offer to use it (e.g. the socket(2) APIs on Unix) feature a lot of knobs to fine tune TCP's behavior. This is intentional, and done in order to make it perform better: reduce latency, increase the used bandwidth, reduce the number of segments exchanged on the network. Some of these knobs are entirely opt-in because whether they make sense or not depends on the application.

A couple of these optimizations are usually enabled by default, and are the cause of the problem here.

The first one is called delayed acknowledgments. In order to reduce the load on the network, one side of a TCP connection may decide to wait a bit before sending out an acknowledgement for the data it has just received. (Of course, it must not wait too much, or the sender may trigger a retransmission thinking that the data it just sent got lost!) It's a pretty simple idea: give the other side the chance of sending more data; if this happens, the receiving side may send only one segment acknowledging all the data received. (Yes, this is perfectly legal in TCP.)

The second one is called Nagle's algorithm. This is an algorithm that again tries to reduce the network load by having one side not transmit data too often (until certain conditions are met). The idea is that if a TCP application has to send small chunks of data, it would be better for the TCP stack to buffer all the data and then send it out in one segment. Of course, sooner or later the data has to be sent, and the algorithm controls when.

How does QNAM manage to break all this? When QNAM has to issue an HTTP request with a body (such as POST or PUT), it does two different writes into the TCP socket: first it writes the HTTP headers, then it writes the body of the request. This "double write" causes an interesting chain of events:

Client: QNAM writes the HTTP headers into the socket
Client: the TCP socket sends the headers right away
Server: receives the data, but does not send the ACK yet, because of delayed acknowledgments
Client: QNAM writes the HTTP body into the socket
Client: the TCP socket does not transmit the HTTP body, because Nagle's algorithm says so (there is 1 pending acknowledgment)
Nothing happens for 50ms
Server: the delayed ack timer times out, so it sends the ACK (just for the HTTP headers)
Client: receives the ACK, Nagle's algorithm says it can transmit more data, so it sends the rest of the request
Server: receives the rest of the request; the request is complete, the SOAP server handles it, creates a reply, sends the ACK for the rest of the request and the reply

Steps 2-4 can happen in a slightly different order, but the result does not change: we have a 50ms delay, introduced by TCP, which is affecting our requests.

Workarounds

What can we do to fix this? This should really be fixable: the overall data we want to transmit (headers+body) fits in one TCP segment -- sending two and having to wait before transmitting the second sounds just wrong.

Note that we do not necessarily control the server, so disabling delayed ACKs is out of the question.

The first thing we could think of is to enable buffering on the TCP socket. QTcpSocket, like all QIODevice subclasses, can be opened in buffered or in unbuffered mode. By default, QNAM opens it in unbuffered mode: it does not want an extra copy of the data to be done (from the application, into the TCP socket, then into the kernel; and similarly the other way around). However, opening it in buffered mode does not help at all. What happens is that QTcpSocket copies the data into its own write buffer, and then immediately tries to flush it. Typically the kernel will allow the flush to happen (because there's not enough outgoing data filling kernel's outgoing buffers); and this results in the same chain of events, with each write (by QNAM) becoming a write into the kernel, and the second one being delayed by Nagle's algorithm.
On the client, we could think of disabling Nagle's algorithm. All operating systems offer a way to do so; the rationale is that if we have to frequently send small chunks of data we don't want to wait for acknowledgements to arrive before sending the next chunk (think, for instance, to a SSH client that has to send individual keystrokes). This option is typically called TCP_NODELAY, and it's even exposed in Qt by the QAbstractSocket::LowDelayOption socket option. For some reason, QNAM has a commented out call to set the option:

// improve performance since we get the request sent by the kernel ASAP
// socket->setSocketOption(QAbstractSocket::LowDelayOption, 1);
// We have this commented out now. It did not have the effect we wanted. If we want to
// do this properly, Qt has to combine multiple HTTP requests into one buffer
// and send this to the kernel in one syscall and then the kernel immediately sends
// it as one TCP packet because of TCP_NODELAY.
// However, this code is currently not in Qt, so we rely on the kernel combining
// the requests into one TCP packet.

Not exactly sure why this comment was made. (This code was written before the Qt source code history had been made public in 2011.) I was not willing to revisit this, so I decided to leave it alone, at least for for the moment being.

The Linux socket API has the ideal option for this use case. It's the TCP_CORK socket option. Its effects are to delay the sending of the data written into the socket until a timeout expires, or the option is toggled. This allows a client (or a server) to write in the socket multiple times, and the kernel will take care of coalescing the writes into fewer TCP segments. (The interesting bit is that the writes can happen using different system calls, such as write(2) for the headers followed by sendfile(2) for the body.). While this works, it's of course OS-specific -- there is nothing comparable for any other operating system -- so it's not a good candidate for fixing the problem in Qt (which is, after all, a cross platform toolkit).
Finally, the above comment in QNAM contains the actual solution. We can refactor QNAM to avoid writing the headers and then the rest of the request, using separate writes. Instead, it should try to combine them, if possible, into one single write. This should ensure that, once a complete request is sent by the client, the server can act on it and immediately write the reply.

I have implemented and submitted this last solution to Qt.

Enough talk! Give me the code!

Ok, ok... the patch that fixes the problem is here. At this time, it will only be present in Qt 6.0. Qt 5 is in "bugfix only" mode, and since this issue has been present throughout the entire Qt 5 history, it was not deemed safe to apply the change there as well.

Still, it doesn't mean you cannot just cherry-pick the patch into your local build of Qt 5, if your applications are affected. It should apply without issues. In fact, the customer was able to try it out and confirm that it solves the original issue they were seeing. Whew!

The moral lesson

At first I didn't want to write a blog post simply to discuss "yet another bugfix that landed into Qt".

But then, I decided to take this opportunity to also say something else. Many of our customers are (positively!) surprised by the huge skillset found within KDAB. By hiring us, customers don't get just someone to write some code for them; they get access to a great talent pool. Within KDAB there's enormous technical knowledge. Each engineer is of course broadly skilled in computer science and programming, but they then have tremendous specialized skills in one or more focus areas:

Win32 programming? Check.
Low level debugging and profiling? Check.
3D graphics? Check.
Usability and design? Check.
Hacking into bluetooth stacks? Check.
Embedded hardware? Check.
Cooking? Check.
Acrobatics? Check.
Aviation? Check.
Modern jazz? Check.
Mechanical keyboards? Check.
Beer crafting? Check, check, and check.

Valve's Handbook for New Employees puts it in a very nice way:

We value “T-shaped” people. That is, people who are both generalists (highly skilled at a broad set of valuable things — the top of the T) and also experts (among the best in their field within a narrow discipline — the vertical leg of the T).

Or, to use the words of my ex-colleague Jean-Michaël:

All the above bugfixing story is actually about knowing how things work, all the way from HTTP, into Qt's code, and down to low level details of TCP/IP programming. Does it mean that I'm just "that good" and happen to know just about everything? We both know that is not true.

This is possible, not only because KDAB hires T-shaped individuals, but also because it nurtures its employees so that their strong "vertical" skills can actually grow over time, while also embracing more "horizontal" skills. We are constantly encouraged in our professional growth, and that's not something always found out there.

That's what makes the difference for KDAB.

In the meanwhile, thanks for reading!

Tags:

linux performance qt tools

About KDAB

The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.

Giuseppe D’Angelo

Senior Software Engineer

Senior Software Engineer at KDAB. Giuseppe is a long-time contributor to Qt, having used Qt and C++ since 2000, and is an Approver in the Qt Project. His contributions in Qt range from containers and regular expressions to GUI, Widgets, and OpenGL. A free software passionate and UNIX specialist, before joining KDAB, he organized conferences on opensource around Italy. He holds a BSc in Computer Science.

There and Back Again

what it entails to fix 10-year-old bugs in QNetworkAccessManager

The problem

Diagnosis

Workarounds

Enough talk! Give me the code!

The moral lesson

Related Content

Sign up for the KDAB Newsletter

There and Back Again

what it entails to fix 10-year-old bugs in QNetworkAccessManager

The problem

Diagnosis

Workarounds

Enough talk! Give me the code!

The moral lesson

Related Content

Hotspot v1.5.0 released

Heaptrack v1.3.0 Release

Full Stack Tracing Part 2

Sign up for the KDAB Newsletter