Synchronization in Vulkan Learn about what Vulkan needs us to synchronize and how to achieve it
An important part of working with Vulkan and other modern explicit rendering APIs is the synchronization of GPU/GPU and CPU/GPU workloads. In this article we will learn about what Vulkan needs us to synchronize and how to achieve it. We will talk about two high-level parts of the synchronization domain that we, as application and library developers, are responsible for:
- GPU↔GPU synchronization to ensure that certain GPU operations do not occur out of order,
- CPU↔GPU synchronization to ensure that we maintain a certain level of latency and resource usage in our applications.
Whereas in OpenGL we could simply render to the GL_BACK buffer of the default framebuffer and then tell the system to swap the back and front buffers, with Vulkan we have to get more involved. Vulkan exposes the concept of a swapchain of images. This is essentially a collection of textures (VkImages) that are owned and managed by the swapchain and the window system integration (WSI). A typical frame in Vulkan looks something like this:
- Acquire the index of the swapchain image to which we should render.
- Record one or more command buffers that ultimately output to the swapchain image from step 1.
- Submit the command buffers from step 2 to a GPU queue for processing.
- Instruct the GPU presentation engine to display the final rendered swapchain image from step 3.
- Go back to step 1 and start over for the next frame.
This may look innocuous at first glance but let’s delve deeper.
A day at the races
In step 1 we are asking the WSI to tell us the index of the next available swapchain image that we may render into. Now, just because this function tells us (and the CPU) that, for example, image index 1 is the image we should use as our render target, it does not mean that the GPU is actually ready to write to this image right now.
It is important to note that we are operating on two distinct timelines. There is the CPU timeline that we are familiar with when writing applications. Then there is also the GPU timeline on which the GPU processes the work that we give to it (from the CPU timeline).
In the case of acquiring a swapchain image index, we are actually asking the GPU to look into the future a little bit and tell us which image index will become the next image to become ready for writing. However, when we call the function to acquire this image index, the GPU presentation engine may well still be reading from the image in question in order to display its contents from an earlier frame.
Many people coming new to Vulkan (myself included) make the mistake of thinking that acquiring the swapchain image index means the image is ready to go right now. It’s not!
In step 2, we are entirely operating on the CPU timeline and we can safely record command buffers without fear of trampling over anything happening on the GPU.
The same is true in step 3. We can happily submit the command buffers which will render to our swapchain image. However, this does then trigger the problem. If the GPU presentation engine is still busy reading from the swapchain image when suddenly along comes a bundle of work that tells the GPU to render into that same image we have a potential problem. GPUs are thirsty beasts and are massively parallel machines that like to do as much as possible concurrently. Without some form of synchronization, it is clear to see that, if the GPU begins processing the command buffers, it could easily lead to a situation where the presentation engine could be reading data at the same time it is being written to by another GPU thread. Say hello to our old friend undefined behaviour!
It is now clear that we need some mechanism to instruct the GPU to not process these command buffers until the GPU presentation engine is done reading from the swapchain image we are rendering to.
The solution for synchronising blocks of GPU work in Vulkan is a semaphore (VkSemaphore).
The way it works is that in our application’s initialisation code, we create a semaphore for the purposes of forcing the command buffer processing to begin only once the GPU presentation engine tells us it is done reading from the swapchain image it told us to use.
With this semaphore in hand, we can tell the GPU to switch it to a “signalled” state when the presentation engine is done reading from the image. The other half of the problem is solved when we submit the render command buffers to the GPU by handing the same semaphore to the call to vkQueueSubmit().
We now have this kind of setup:
- At initialisation, create a semaphore (vkCreateSemaphore) in the unsignalled state.
- Pass the above semaphore to vkAcquireNextImageKHR as the semaphore argument so that it is signalled when the image is ready for writing.
- Pass the above semaphore to vkQueueSubmit (as one of the pWaitSemaphore arguments of the VkSubmitInfo struct) so that this set of command buffers is deferred until the semaphore is signalled.
Phew, we’re all done right? Nope, sadly not. Read on to see what else can go wrong and how to solve it.
I’m not ready to show you my painting
We have solved the race condition on the GPU of preventing the start of the rendering from clobbering the swapchain image whilst the presentation engine may still be reading from it. However, there is currently nothing to prevent the request to begin the presentation of the swapchain image whilst the rendering is still going on!
That is, we have solved the potential race between steps 1 and 3, but there is another race between steps 3 and 4. Luckily the problem is at heart exactly the same. We need to stop some incoming GPU work (the present request in step 4) from stepping on the toes of the already ongoing rendering work from step 3. That is, we need another application of GPU↔GPU synchronization which we know we can do with a semaphore.
To solve this race condition we use the following approach:
- At initialisation, create another unsignalled semaphore.
- In step 3 when we submit the command buffers for rendering, we pass in the semaphore to vkQueueSubmit as one of the pSignalSemaphores arguments.
- In step 4 we then pass this same semaphore to the call to vkQueuePresentKHR as one of the pWaitSemaphores arguments.
This works in a completely analogous way to the first problem that we solved. When we submit the render command buffers for processing, this second semaphore is unsignalled. When the command buffers finish execution, the GPU will transition the semaphore to the signalled state. The call to vkQueuePresentKHR has been configured to ensure the presentation engine waits for this condition to be true before beginning whatever work it needs to do to get that image on to our screen.
With the above two race conditions brought under control, we can now safely loop around the sequence of steps 1-4 as many times as we like.
Well, almost. There is a slight subtlety in that the swapchain has N frames (typically 3 or so) but so far we have only created a single semaphore for the presentation→render ordering, and a second single semaphore for the render→presentation ordering. Usually however, we do not want to render and present a single image and then wait around for the presentation to be done before starting over, as that is a big waste of cycles on both the CPU and GPU sides.
As a side note, many Vulkan examples in tutorials do this by introducing a call to vkDeviceWaitIdle or vkQueueWaitIdle somewhere in their main loop. This is fine for learning Vulkan and its concepts but to get full performance we want to go further into allowing the CPU and the GPU to work concurrently.
One thing that we can do is to create enough semaphores such that we have one each for every frame that we wish to have “in flight” at any time and for each of the 2 required synchronization points. We can then use the i’th pair of semaphores for the i’th in-flight frame and when we get to the N’th in-flight frame we loop back to the 0’th pair of semaphores in a round robin fashion.
This then allows us to get potentially N frames ahead of the GPU on the CPU timeline. This, unfortunately, opens up our next can of worms.
So far we have shown that using semaphores when enqueuing work for the GPU allows us to correctly order the work done on the GPU timeline. We have briefly mentioned that this does nothing to keep the CPU in sync with the GPU. As it stands right now the CPU is free to schedule as much work in advance as we like (assuming sufficient available resources). This has a couple of issues though:
- The more frames of work in advance the CPU schedules work for the GPU, the more resources we need to hold command buffers, semaphores etc. – not to mention the GPU resources to which the command buffers refer, such as buffers and textures. These GPU resources all have to be kept alive as long as any command buffers are referencing them.
- The second issue is that the further the CPU gets ahead of the GPU the further our simulation state gets ahead of what we see. That means, the more frames ahead we allow the CPU to get, the higher is our latency. Some latency can be good in that if we have a frame or two queued up already, a frame that then takes a bit longer to prepare can be absorbed unnoticed. However, too much latency and our application feels sluggish and unnatural to use as it takes too long for our input to be responded to and for us to see the results of that on screen.
It is therefore essential to have a good handle on our system’s latency which in this case means the number of frames we allow to be “in flight” at any one time. That is, the number of frames worth of command buffers that have been submitted to the GPU queues and are being recorded at the current time. A common choice here is to allow 2-3 frames to be in flight at once. Bear in mind that this also depends upon other factors such as your display’s refresh rate. If you are running on a high refresh rate display at say 240Hz, then each frame is only around for 1/4 of the time of a “standard” 60Hz display. If this is the case, you may wish to increase the number of frames in flight to compensate.
Let’s parameterise the max number of frames that the CPU can get ahead as MAX_FRAMES_IN_FLIGHT. From our discussions in the previous sections we know that if we can keep the CPU from getting ahead by only MAX_FRAMES_IN_FLIGHT frames, then we will only need MAX_FRAMES_IN_FLIGHT semaphores for each use of a semaphore within a frame.
So now the question is how do we stop the CPU from racing ahead of the GPU? Specifically we need a way to make the CPU timeline wait until the GPU timeline indicates that it is done with processing a frame. In Vulkan, the answer to this is a fence (VkFence). Conceptually this is how we can structure a frame with fences to get the desired result (ignoring the use of semaphores for GPU↔GPU synchronization):
- In the application initialisation, create MAX_FRAMES_IN_FLIGHT fence objects in the signalled state.
- Force the CPU timeline to wait until the fence for this frame becomes signalled or continue immediately if it is the first frame and the fence is already signalled (vkWaitForFences).
- Reset the fence to the unsignalled state so that we can wait for it again in the future (vkResetFences).
- Acquire the swapchain image index (as before).
- Record and submit the command buffers to perform the rendering for this frame. When it is time to submit the command buffers to the GPU queue, we can pass in the fence for this frame as the final argument to vkQueueSubmit. Just as with a semaphore, when the GPU queue finishes processing this command buffer submission, it will transition the fence to the signalled state.
- Issue a GPU command to present the completed swapchain image (as before).
- Go to step 2 and use the next fence and (set of semaphores).
With this approach, the CPU timeline can only get at most MAX_FRAMES_IN_FLIGHT ahead of the GPU before the call to vkWaitForFences in step 2 forces it to wait for the corresponding fence to become signalled by the GPU. This is when it completes command buffer submission that went along with this fence.
Making use of both fences and semaphores allows us to nicely keep both the CPU and the GPU timelines making progress without races (between rendering and presentation) and without the CPU running away from us. These two synchronization primitives, fences and semaphores, solve similar but different problems:
- A VkFence is a synchronization primitive to allow the keeping of the GPU and CPU timelines in pace.
- A VkSemaphore is a synchronization primitive to ensure ordering of GPU tasks.
It is also worth noting that a VkFence can also be queried as to its state from the CPU timeline rather than having to block until it becomes signalled (vkGetFenceStatus). This allows your application to peek and see if a fence is signalled or not. If it is not yet signalled, your application may be able to make more use of the available time to go do something more productive than just blocking like with vkWaitFences. It all depends upon the design of your application.
We have seen above how we can utilise fences and semaphores to make our Vulkan applications well-behaved. It is also worth mentioning that, as an application author, you should also consider your choice of swapchain presentation mode. This is because this can heavily impact on how your application behaves and how many CPU/GPU cycles it uses. With OpenGL we would typically setup to have either:
VSync enabled rendering for tear-free display OR VSync disabled rendering and go as fast as you can but probably see some image tearing.
With Vulkan we can still get these configurations but there are also others that offer variations. As an example, VK_PRESENT_MODE_MAILBOX_KHR allows us to have tear-free display of the currently presented image (it is vsync enabled), but we can have our application also render as fast as possible. Very briefly, the way this works is that when the presentation engine is displaying swapchain image 0, our calls to vkAcquireNextImageKHR will only return the other swapchain image indices. When we subsequently tell the GPU to present those images it will happily take the image and overwrite your previous presentation submission. When the next vertical blank occurs, the presentation engine will actually show the most up to date submitted swapchain image.
In this manner we can render to e.g. images 1 and 2 as many times as we like so that when the presentation engine moves along, it has the most up to date representation of our application’s state possible.
Depending upon which swapchain presentation mode you request, your application could be locked to the VSync frequency or not, which in turn can lead to large differences in how much of your available CPU and GPU resources are consumed. Are they out for a leisurely stroll (VSync enabled) or sprinting (VSync disabled or mailbox mode)?
Multiple Windows and Swapchains
All of the above examples have assumed we are working with a single window surface and a single swapchain. This is the common case for games but in desktop and embedded applications we may well have multiple windows or multiple screens or even multiple adapters. Vulkan, unlike OpenGL, is pretty flexible when it comes to threading. With some care, we can happily record the command buffers for different windows (swapchains) on different CPU threads. For swapchains sharing a common Vulkan device, we can even request them all to be presented in one function call rather than having to call the equivalent of swapbuffers on each of them sequentially. Once again, Vulkan and the WSI gives you the tools, it’s up to you how you utilise them.
A more recent addition to Vulkan, known as timeline semaphores, allows applications to use this synchronization primitive to work like a combination of a traditional semaphore and a fence. A timeline semaphore can be used just like a traditional (binary) semaphore to order packets of GPU work correctly, but they may also be waited upon by the CPU timeline (vkWaitSemaphores). The CPU may also signal a timeline semaphore via a call to vkSignalSemaphore. If found to be supported by your Vulkan version and driver, you can use timeline semaphores to simplify your synchronization mechanisms.
Pipeline and Memory Barriers
This article has only concerned itself with the high-level or coarse synchronization requirements. Depending upon what you are doing inside of your command buffers you will also likely need to take care of synchronising access to various other resources. These include textures and buffers to ensure that different phases of your rendering are not trampling over each other. This is a large topic in itself and is covered in extreme detail by this article and the accompanying examples.
It’s up to you!
A lot of what the OpenGL driver used to manage for us, is now firmly on the plate of application and library developers who wish to make use of Vulkan or other explicit modern graphics APIs. Vulkan provides us with a plethora of tools but it is up to us to decide how to make best use of them and how to map them onto the requirements of our applications. I hope that this article has helped explain some of the considerations of synchronization that you need to keep in mind when you decide to take the next step from the tutorial examples and remove that magic call to vkDeviceWaitIdle.