State of Native Big File Handling in Git
After many struggles with using git LFS on repositories that need to store big files, I decided to spend some time on checking the status of the built-in partial clone functionality that could possibly let you achieve the same (as of git 2.30).
TL;DR: The building blocks are there, but server-side support is spotty and critical tooling is missing. It’s not very usable yet, but it’s getting there.
How partial clone works
Normally, when you clone a git repository, all file versions throughout the whole repository history are downloaded. If you have multiple revisions of multi-GB binary files, as we have in some projects, this becomes a problem.
Partial clone lets you download only a subset of objects in the repository and defer downloading the rest, until needed. Most of the time, it means checkout.
For example, to clone a repository with blobs in only the latest version of the default branch, you can do as follows:
git clone --filter=blob:none email@example.com:repo.git
--filter part is crucial; it tells you what to include/omit. It’s called a filter-spec, across the git docs. The specifics of how you can filter are available in
git help rev-list. You can filter based on blob size, location the in tree (slow! – guess why), or both.
The remote from which you cloned will be called a “promisor” remote, so-called because it promises to fulfill requests for missing objects when requested later:
[remote "origin"] url = firstname.lastname@example.org:repo.git fetch = +refs/heads/*:refs/remotes/origin/* promisor = true partialclonefilter = blob:limit=1048576 # limited to 1M
As you change branches, the required files will be downloaded on-demand during checkout.
Below is a video of a partial checkout in action. Notice how the actual files are downloaded during the checkout operation, and not during clone:
I checked out Linux kernel from the GitHub mirror, through regular and partial clone, and recorded some statistics:
As you can see, there are some tradeoffs. Checking out takes longer because the actual file content has to be downloaded, not just copied from object store. There are savings in terms of initial clone and repository size because you’re not storing a copy of various driver sources deprecated since the late 90s. The gains would be even more pronounced in repositories that store multiple versions of big binary files. Think evolving game assets or CI system output.
So what are the problems?
Missing/incomplete/buggy server support
The server side needs to implement git v2 protocol. Many don’t do it yet, or do it in a limited manner.
- GitLab – filter only by
- GitHub – same as GitLab
- Gerrit – supported since version 3.1 (when git protocol v2 was added). You need to allow filters in the global JGit config (
[uploadpack] allowFilter = true
Unfortunately, we experienced some bugs when trying this out.
- BitBucket server – no support (https://jira.atlassian.com/browse/BSERV-11639)
- BitBucket cloud – no support (https://jira.atlassian.com/browse/BCLOUD-19847)
- Gitea – experimental support, requires neurosurgery on the internal repository folders (https://github.com/go-gitea/gitea/pull/12170)
No cleanup tool
As you check out new revisions with big files and download them, you will end up with lots of old data from the previous versions because it’s not cleaned up automatically.
Git LFS has the
git lfs prune tool. No such thing yet exists for partial clones.
See this git mailing list thread.
No separate storage of big files on server side (yet)
Since you want server-side operations to happen quickly, it’s best to store the git repository on a very fast storage, which also happens to be expensive. It would be nice to store big files that don’t really need fast operations (you won’t do diffs on textures or sound files server-side) separately.
Christian Couder of GitLab is working on something around this. It’s already possible to have multiple promisor remotes queried in succession. For example, there could be separate promisor remote backed by a CDN or cloud storage (e.g. S3). However, servers will need to learn how to push the data there when users push their trees.
See this git mailing list thread.
Generally cumbersome UX
Since everything is fresh, you need to add some magical incantations to git commands to have it working. Ideally, some “recommended” filter should be stored server-side, so that users don’t have to come up with filter spec on their own, when cloning.
Below are some useful links, if you’d like to learn more about partial cloning in git:
- GitLab blog
- Git man page about partial cloning
- GitLab documentation about partial cloning
Currently, a lot of effort around partial cloning is driven by Christian Couder of GitLab. You can follow some of the development under the following links:
- Cleanup tool: https://gitlab.com/gitlab-org/git/-/issues/10
- partial clone tag on GitLab’s git issue tracker
If you would like to learn Git, KDAB offers an introductory training class.
If you like this article and want to read similar material, consider subscribing via our RSS feed.
Subscribe to KDAB TV for similar informative short video content.
KDAB provides market leading software consulting and development services and training in Qt, C++ and 3D/OpenGL. Contact us.