State of Native Big File Handling in Git

15 June 2021

How partial clone works

Normally, when you clone a git repository, all file versions throughout the whole repository history are downloaded. If you have multiple revisions of multi-GB binary files, as we have in some projects, this becomes a problem.

Partial clone lets you download only a subset of objects in the repository and defer downloading the rest, until needed. Most of the time, it means checkout.

For example, to clone a repository with blobs in only the latest version of the default branch, you can do as follows:

git clone --filter=blob:none git@example.com:repo.git

The --filter part is crucial; it tells you what to include/omit. It's called a filter-spec, across the git docs. The specifics of how you can filter are available in git help rev-list. You can filter based on blob size, location the in tree (slow! - guess why), or both.

The remote from which you cloned will be called a "promisor" remote, so-called because it promises to fulfill requests for missing objects when requested later:

[remote "origin"]
    url = git@example.com:repo.git
    fetch = +refs/heads/*:refs/remotes/origin/*
    promisor = true
    partialclonefilter = blob:limit=1048576 # limited to 1M

As you change branches, the required files will be downloaded on-demand during checkout.

Below is a video of a partial checkout in action. Notice how the actual files are downloaded during the checkout operation, and not during clone:

To watch this video on our website please or view it directly on YouTube

Comparison

I checked out Linux kernel from the GitHub mirror, through regular and partial clone, and recorded some statistics:

As you can see, there are some tradeoffs. Checking out takes longer because the actual file content has to be downloaded, not just copied from object store. There are savings in terms of initial clone and repository size because you're not storing a copy of various driver sources deprecated since the late 90s. The gains would be even more pronounced in repositories that store multiple versions of big binary files. Think evolving game assets or CI system output.

So what are the problems?

Missing/incomplete/buggy server support

The server side needs to implement git v2 protocol. Many don't do it yet, or do it in a limited manner.

GitLab - filter only by :none or size
GitHub - same as GitLab
Gerrit - supported since version 3.1 (when git protocol v2 was added). You need to allow filters in the global JGit config (etc/jgit.conf). Unfortunately, we experienced some bugs when trying this out:

[uploadpack]
    allowFilter = true

BitBucket server - no support (https://jira.atlassian.com/browse/BSERV-11639)
BitBucket cloud - no support (https://jira.atlassian.com/browse/BCLOUD-19847)
Gitea - experimental support, requires neurosurgery on the internal repository folders (https://github.com/go-gitea/gitea/pull/12170)

No cleanup tool

As you check out new revisions with big files and download them, you will end up with lots of old data from the previous versions because it's not cleaned up automatically.
Git LFS has the git lfs prune tool. No such thing yet exists for partial clones.
See this git mailing list thread.

No separate storage of big files on server side (yet)

Since you want server-side operations to happen quickly, it's best to store the git repository on a very fast storage, which also happens to be expensive. It would be nice to store big files that don't really need fast operations (you won't do diffs on textures or sound files server-side) separately.

Christian Couder of GitLab is working on something around this. It's already possible to have multiple promisor remotes queried in succession. For example,
there could be separate promisor remote backed by a CDN or cloud storage (e.g. S3). However, servers will need to learn how to push the data there when users push their trees.

See this git mailing list thread.

Generally cumbersome UX

Since everything is fresh, you need to add some magical incantations to git commands to have it working. Ideally, some "recommended" filter should be stored server-side, so that users don't have to come up with filter spec on their own, when cloning.

Resources

Below are some useful links, if you'd like to learn more about partial cloning in git:

GitLab blog
Git man page about partial cloning
GitLab documentation about partial cloning

Currently, a lot of effort around partial cloning is driven by Christian Couder of GitLab. You can follow some of the development under the following links:

Cleanup tool: https://gitlab.com/gitlab-org/git/-/issues/10
partial clone tag on GitLab's git issue tracker

If you would like to learn Git, KDAB offers an introductory training class.

Tags:

c++open source qml tools

About KDAB

The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.

Miłosz Kosobucki

Senior Software Engineer

Miłosz Kosobucki is a Senior Software Engineer at KDAB

State of Native Big File Handling in Git

How partial clone works

Comparison

So what are the problems?

Missing/incomplete/buggy server support

No cleanup tool

No separate storage of big files on server side (yet)

Generally cumbersome UX

Resources

Related Content

Sign up for the KDAB Newsletter

State of Native Big File Handling in Git

How partial clone works

Comparison

So what are the problems?

Missing/incomplete/buggy server support

No cleanup tool

No separate storage of big files on server side (yet)

Generally cumbersome UX

Resources

Related Content

CXX-Qt

Clazy Results Visualizer for Qt

10 Tips to Make Your QML Code Faster and More Maintainable

Sign up for the KDAB Newsletter