There and Back Again what it entails to fix 10-year-old bugs in QNetworkAccessManager
A few days ago, a customer using our KD Soap library reported an interesting problem with it. The library worked perfectly to access the SOAP server, but there was some noticeable latency when doing a series of requests in short succession from the client. This latency should not have been there, as both the server and the client were in the same local network.
An investigation began, regarding whether the client code or KD Soap were doing something suboptimal, like reconnecting to the server all the time, or something like that. The actual reason turned out to be something much more profound: a 6-year-old bug in QNetworkAccessManager (at least, that’s when it was reported. I think the bug is at least 10 years old…)
What’s QNetworkAccessManager (or QNAM, for the friends)? It’s the central class in Qt to issue HTTP/FTP requests (incl. HTTPS, HTTP/2, etc.); KD Soap uses it to implement SOAP over HTTP.
So, “not our bug”, problem solved… maybe for us, but certainly not for our customer!
Therefore, I took a shot at the issue anyhow, as personal curiosity. (Mandatory reminder: KDAB contributes a lot of code to Qt!)
Using a network sniffer like Wireshark, the problem quickly emerged: each SOAP request, transported through a POST method, was sent by QNAM split over multiple TCP segments.
Let’s analyze the packet dump together. We have 192.168.1.1, which is the client running KD Soap, and 192.168.1.254 running a HTTP server.
In the first few exchanged segments, nothing special happens:
No. Time Source Destination Protocol Length Info 2 0.003375 192.168.1.1 192.168.1.254 TCP 66 55212 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM=1 3 0.003504 192.168.1.254 192.168.1.1 TCP 66 80 → 55212 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=128 4 0.003578 192.168.1.1 192.168.1.254 TCP 54 55212 → 80 [ACK] Seq=1 Ack=1 Win=2102272 Len=0 5 0.003702 192.168.1.1 192.168.1.254 TCP 333 55212 → 80 [PSH, ACK] Seq=1 Ack=1 Win=2102272 Len=279 [TCP segment of a reassembled PDU] 6 0.003781 192.168.1.254 192.168.1.1 TCP 60 80 → 55212 [ACK] Seq=1 Ack=280 Win=30336 Len=0 7 0.003815 192.168.1.1 192.168.1.254 HTTP/XML 349 POST / HTTP/1.1 8 0.003894 192.168.1.254 192.168.1.1 TCP 60 80 → 55212 [ACK] Seq=1 Ack=575 Win=31360 Len=0 9 0.005749 192.168.1.254 192.168.1.1 HTTP/XML 655 HTTP/1.1 200 OK
- Segments 2-4 are the typical TCP three-way handshake, connecting to the server.
- Segment 5 contains the first part of the first SOAP request.
- Segment 6 quickly acknowledges the data sent by segment 5.
- Segment 7 completes the first SOAP request (that is, completes the HTTP POST).
- Segment 8 acknowledges the data sent in segment 7. (Note: Please ignore the length field — it does not refer to the TCP data length, but to the request length; that’s why the acknowledged data does not match.)
- Segment 9 sends the result back to the client.
The whole exchange (establishing the connection, sending the request, getting an answer) took less than 3 milliseconds. Pretty reasonable, considering everything is happening inside the same local network.
Now this is when things start going wrong. The client did not make just one request; it made several, in quick succession. These requests got queued inside QNAM. QNAM implements HTTP persistent connections, so the TCP connection is not immediately shut down after the request has been served. Instead, once QNAM receives the reply, it’s able to immediately send another request. This happens in the next exchange:
No. Time Source Destination Protocol Length Info 10 0.006527 192.168.1.1 192.168.1.254 TCP 333 55212 → 80 [PSH, ACK] Seq=575 Ack=602 Win=2101760 Len=279 [TCP segment of a reassembled PDU] 11 0.049994 192.168.1.254 192.168.1.1 TCP 60 80 → 55212 [ACK] Seq=602 Ack=854 Win=32512 Len=0 12 0.050063 192.168.1.1 192.168.1.254 HTTP/XML 349 POST / HTTP/1.1 13 0.050269 192.168.1.254 192.168.1.1 TCP 60 80 → 55212 [ACK] Seq=602 Ack=1149 Win=33536 Len=0 14 0.051334 192.168.1.254 192.168.1.1 HTTP/XML 655 HTTP/1.1 200 OK
Here we have:
- Segment 10 acknowledges the data received by the server, and sends part of a new request.
- Segment 11 acknowledges the data for the new request.
- Segment 12 completes the request (sending more data).
- Segment 13 acknowledges the data sent to complete the request.
- Segment 14 sends the result back to the client.
It looks pretty much like the last exchange, minus establishing the connection, doesn’t it? There’s however a small but important difference: if you carefully look at the timestamps, you may notice that this request took almost 50ms to be served! That’s 10-20 times more than the first one!
If we keep going, we have another request, and this one also takes 50ms:
No. Time Source Destination Protocol Length Info 15 0.052316 192.168.1.1 192.168.1.254 TCP 333 55212 → 80 [PSH, ACK] Seq=1149 Ack=1203 Win=2100992 Len=279 [TCP segment of a reassembled PDU] 16 0.099992 192.168.1.254 192.168.1.1 TCP 60 80 → 55212 [ACK] Seq=1203 Ack=1428 Win=34560 Len=0 17 0.100047 192.168.1.1 192.168.1.254 HTTP/XML 349 POST / HTTP/1.1 18 0.100315 192.168.1.254 192.168.1.1 TCP 60 80 → 55212 [ACK] Seq=1203 Ack=1723 Win=35712 Len=0 19 0.101336 192.168.1.254 192.168.1.1 HTTP/XML 655 HTTP/1.1 200 OK
These delays would just keep piling up; in order to send 20 requests we take over one second, rather than less than 0.1s (as it should be). What’s going on here? Where do these delays come from?
As hinted in the bug report, the problem is a very subtle interaction between QNAM and the underlying TCP stack. The TCP protocol itself, as well as the API that operating systems offer to use it (e.g. the socket(2) APIs on Unix) feature a lot of knobs to fine tune TCP’s behavior. This is intentional, and done in order to make it perform better: reduce latency, increase the used bandwidth, reduce the number of segments exchanged on the network. Some of these knobs are entirely opt-in because whether they make sense or not depends on the application.
A couple of these optimizations are usually enabled by default, and are the cause of the problem here.
The first one is called delayed acknowledgments. In order to reduce the load on the network, one side of a TCP connection may decide to wait a bit before sending out an acknowledgement for the data it has just received. (Of course, it must not wait too much, or the sender may trigger a retransmission thinking that the data it just sent got lost!) It’s a pretty simple idea: give the other side the chance of sending more data; if this happens, the receiving side may send only one segment acknowledging all the data received. (Yes, this is perfectly legal in TCP.)
The second one is called Nagle’s algorithm. This is an algorithm that again tries to reduce the network load by having one side not transmit data too often (until certain conditions are met). The idea is that if a TCP application has to send small chunks of data, it would be better for the TCP stack to buffer all the data and then send it out in one segment. Of course, sooner or later the data has to be sent, and the algorithm controls when.
How does QNAM manage to break all this? When QNAM has to issue an HTTP request with a body (such as POST or PUT), it does two different writes into the TCP socket: first it writes the HTTP headers, then it writes the body of the request. This “double write” causes an interesting chain of events:
- Client: QNAM writes the HTTP headers into the socket
- Client: the TCP socket sends the headers right away
- Server: receives the data, but does not send the ACK yet, because of delayed acknowledgments
- Client: QNAM writes the HTTP body into the socket
- Client: the TCP socket does not transmit the HTTP body, because Nagle’s algorithm says so (there is 1 pending acknowledgment)
- Nothing happens for 50ms
- Server: the delayed ack timer times out, so it sends the ACK (just for the HTTP headers)
- Client: receives the ACK, Nagle’s algorithm says it can transmit more data, so it sends the rest of the request
- Server: receives the rest of the request; the request is complete, the SOAP server handles it, creates a reply, sends the ACK for the rest of the request and the reply
Steps 2-4 can happen in a slightly different order, but the result does not change: we have a 50ms delay, introduced by TCP, which is affecting our requests.
What can we do to fix this? This should really be fixable: the overall data we want to transmit (headers+body) fits in one TCP segment — sending two and having to wait before transmitting the second sounds just wrong.
Note that we do not necessarily control the server, so disabling delayed ACKs is out of the question.
- The first thing we could think of is to enable buffering on the TCP socket. QTcpSocket, like all QIODevice subclasses, can be opened in buffered or in unbuffered mode. By default, QNAM opens it in unbuffered mode: it does not want an extra copy of the data to be done (from the application, into the TCP socket, then into the kernel; and similarly the other way around). However, opening it in buffered mode does not help at all. What happens is that QTcpSocket copies the data into its own write buffer, and then immediately tries to flush it. Typically the kernel will allow the flush to happen (because there’s not enough outgoing data filling kernel’s outgoing buffers); and this results in the same chain of events, with each write (by QNAM) becoming a write into the kernel, and the second one being delayed by Nagle’s algorithm.
- On the client, we could think of disabling Nagle’s algorithm. All operating systems offer a way to do so; the rationale is that if we have to frequently send small chunks of data we don’t want to wait for acknowledgements to arrive before sending the next chunk (think, for instance, to a SSH client that has to send individual keystrokes). This option is typically called TCP_NODELAY, and it’s even exposed in Qt by the QAbstractSocket::LowDelayOption socket option. For some reason, QNAM has a commented out call to set the option:
// improve performance since we get the request sent by the kernel ASAP //socket-&amp;gt;setSocketOption(QAbstractSocket::LowDelayOption, 1); // We have this commented out now. It did not have the effect we wanted. If we want to // do this properly, Qt has to combine multiple HTTP requests into one buffer // and send this to the kernel in one syscall and then the kernel immediately sends // it as one TCP packet because of TCP_NODELAY. // However, this code is currently not in Qt, so we rely on the kernel combining // the requests into one TCP packet.
Not exactly sure why this comment was made. (This code was written before the Qt source code history had been made public in 2011.) I was not willing to revisit this, so I decided to leave it alone, at least for for the moment being.
- The Linux socket API has the ideal option for this use case. It’s the TCP_CORK socket option. Its effects are to delay the sending of the data written into the socket until a timeout expires, or the option is toggled. This allows a client (or a server) to write in the socket multiple times, and the kernel will take care of coalescing the writes into fewer TCP segments. (The interesting bit is that the writes can happen using different system calls, such as write(2) for the headers followed by sendfile(2) for the body.). While this works, it’s of course OS-specific — there is nothing comparable for any other operating system — so it’s not a good candidate for fixing the problem in Qt (which is, after all, a cross platform toolkit).
- Finally, the above comment in QNAM contains the actual solution. We can refactor QNAM to avoid writing the headers and then the rest of the request, using separate writes. Instead, it should try to combine them, if possible, into one single write. This should ensure that, once a complete request is sent by the client, the server can act on it and immediately write the reply.
I have implemented and submitted this last solution to Qt.
Enough talk! Give me the code!
Ok, ok… the patch that fixes the problem is here. At this time, it will only be present in Qt 6.0. Qt 5 is in “bugfix only” mode, and since this issue has been present throughout the entire Qt 5 history, it was not deemed safe to apply the change there as well.
Still, it doesn’t mean you cannot just cherry-pick the patch into your local build of Qt 5, if your applications are affected. It should apply without issues. In fact, the customer was able to try it out and confirm that it solves the original issue they were seeing. Whew!
The moral lesson
At first I didn’t want to write a blog post simply to discuss “yet another bugfix that landed into Qt”.
But then, I decided to take this opportunity to also say something else. Many of our customers are (positively!) surprised by the huge skillset found within KDAB. By hiring us, customers don’t get just someone to write some code for them; they get access to a great talent pool. Within KDAB there’s enormous technical knowledge. Each engineer is of course broadly skilled in computer science and programming, but they then have tremendous specialized skills in one or more focus areas:
- Win32 programming? Check.
- Low level debugging and profiling? Check.
- 3D graphics? Check.
- Usability and design? Check.
- Hacking into bluetooth stacks? Check.
- Embedded hardware? Check.
- Cooking? Check.
- Acrobatics? Check.
- Aviation? Check.
- Modern jazz? Check.
- Mechanical keyboards? Check.
- Beer crafting? Check, check, and check.
Valve’s Handbook for New Employees puts it in a very nice way:
We value “T-shaped” people. That is, people who are both generalists (highly skilled at a broad set of valuable things — the top of the T) and also experts (among the best in their field within a narrow discipline — the vertical leg of the T).
Or, to use the words of my ex-colleague Jean-Michaël:
The technical skill of KDABians is *excellent*, be it in C++ with multiple committee attendees, in GPU programming (hey @sean_harmer @mkrus @zakalawe !), in Qt and KDE techs (hey @tsdgeos @VolkerKrause @sergiokdab !), in debugging, profiling, tooling (hey Milian, hey David !),
— Jean-Michaël Celerier (@jcelerie) July 22, 2020
All the above bugfixing story is actually about knowing how things work, all the way from HTTP, into Qt’s code, and down to low level details of TCP/IP programming. Does it mean that I’m just “that good” and happen to know just about everything? We both know that is not true.
This is possible, not only because KDAB hires T-shaped individuals, but also because it nurtures its employees so that their strong “vertical” skills can actually grow over time, while also embracing more “horizontal” skills. We are constantly encouraged in our professional growth, and that’s not something always found out there.
That’s what makes the difference for KDAB.
(If this made you curious about what it means to work with us, feel free to visit our careers page and drop us a line.)
In the meanwhile, thanks for reading!