Using Git LFS (Large File Storage) for data repositories¶
This page describes how to use Git LFS for DM development.
DM uses Git LFS to manage test datasets within our normal Git workflow. Git LFS is developed by GitHub, though DM uses its own backend storage infrastructure (see SQR-001: The Git LFS Architecture for background).
All DM repositories should use Git LFS to store sizeable binary data, such as FITS files, for CI. Examples of LFS-backed repositories are lsst/afwdata, lsst/testdata_ci_hsc, lsst/testdata_decam and lsst/testdata_cfht.
On this page
Installing Git LFS¶
In most Science Pipelines installations, including those in the Rubin Science Platform, git lfs is already installed as part of the rubin-env
conda metapackage.
Otherwise, you can download and install the git-lfs client by visiting the Git LFS homepage.
Many package managers, like Homebrew on the Mac, also provide git-lfs (brew install git-lfs
for example).
We recommend using the latest Git LFS client. The minimum usable client version for Rubin Observatory is git-lfs 2.3.4.
Git LFS requires Git version 1.8.2 or later to be installed.
Before you can use Git LFS with Rubin Observatory data you’ll need to configure it by following the next section.
Configuring Git LFS¶
Basic configuration¶
Run this command to add a filter "lfs"
section to ~/.gitconfig
.
This command has to be done once on every machine you are planning to
read or write Rubin Observatory LFS repos on.
git lfs install
Configuration for Rubin Observatory¶
Read-Only¶
You’re done. The git lfs install
command that you just ran will
allow you to access everything in Large File Storage.
Try cloning a small data repository to test your configuration:
git clone https://github.com/lsst/testdata_subaru
If the resulting new directory is about 220MB in size, as measured by du -sh testdata_subaru
, you are correctly configured for Git LFS use.
If you are a developer who will need to update those files, read on.
Read-Write¶
This section describes how to configure Git LFS to write to the Rubin Observatory Large File Storage repositories.
You will first need to acquire a token from Roundtable. Go to
https://roundtable.lsst.cloud/auth/tokens and request a token with scope
write:git-lfs
. It would be best practice to request a token with
a finite lifetime, but on your own conscience be it if you ask for one
that never expires.
Copy that token, because this is the only time Gafaelfawr will show it to you, and you will need it to push content.
next, add these lines into your ~/.gitconfig
file:
# Cache auth for write access to DM Git LFS
[credential "https://git-lfs-rw.lsst.cloud"]
helper = store
Then edit your ~/.git-credentials
file (create one, if
necessary). Add a line:
https://<username>:<token>@git-lfs-rw.lsst.cloud
Where <username>
is the username you used to authenticate to
Roundtable, and <token>
is the token with write:git-lfs
scope
you just acquired.
Authenticating for push access¶
If you want to push LFS-backed files to a Rubin Observatory Git LFS-backed repository you’ll need to configure and cache your credentials, as described at Read-Write.
For each repository you intend to push to, there is a one-time setup process you must do when you clone it.
Clone the repository, cd
into it, and update the git LFS URL to use
the read-write URL for that repository, which will be
https://git-lfs-rw.lsst.cloud/
followed by the last two components
of the repository (that is, organization and repository name).
For instance, if you were working with
https://github.com/lsst/testdata_subaru
, you’d just type:
git clone https://github.com/lsst/testdata_subaru
cd testdata_subaru
git config lfs.url https://git-lfs-rw.lsst.cloud/lsst/testdata_subaru
git config lfs.locksverify false
Migrating Git LFS¶
If you were already using Git LFS with https://git-lfs.lsst.codes
you will find that pulling a repository you had previously been using
fails. This is because the .lfsconfig
in that repository is still
referencing the old LFS URL, which is no longer serving content.
In that case you need to run:
git config lfs.url https://git-lfs.lsst.cloud/<owner>/<repo>
and
then try the pull again.
If you know you are going to need to push LFS objects, you can use
git-lfs-rw
here instead as described above, or you can configure
that later, if and when you need it.
Checking out historical commits¶
If you want to check out a historical commit, you first need to know
that arbitrary commits are no longer available. When we migrated from
git-lfs.lsst.codes
to git-lfs.lsst.cloud
we only migrated LFS
objects that were either at the tip of the main
branch or a release
branch (one whose name begins with v
followed by a digit), or were
referenced in a Git tag.
If your proposed checkout meets these criteria, next you will will find
that the LFS object fetch fails, because only recent commits reference
git-lfs.lsst.cloud
rather than git-lfs.lsst.codes
, and the
checkout will reset .lfsconfig
to its old value. What you will need
to do in that case is the following.
Attempt the checkout as normal. It will fail when it starts to smudge any files that differ from the previous checkout.
Next, you must edit
.lfsconfig
to referencehttps://git.lfs-rw.lsst.cloud/<org>/<repo>
rather thanhttps://git-lfs.lsst.codes
; you can do this either by simply editing the file, or withgit config lfs.url https://git-lfs-rw.lsst.cloud/<org>/<repo>
.Finally, execute
git lfs fetch
to download the LFS objects.
Using Git LFS-enabled repositories¶
Git LFS operates transparently to the user. Just use the repo as you normally would any other Git repo. All of the regular Git commands just work, whether you are working with LFS-managed files or not.
There are three caveats for working with LFS: HTTPS is always used, Git LFS must be told to track new binary file types, and you usually need enough memory to hold the largest file.
First, DM’s LFS implementation mandates the HTTPS transport protocol. Developers used to working with ssh-agent for passwordless GitHub interaction should use a Git credential helper, and follow the directions above for configuring their credentials.
Note this does not preclude using git+git
or git+ssh
for working with a Git remote itself; it is only the LFS traffic that always uses HTTPS.
Second, in an LFS-backed repository, you need to specify what files are stored by LFS rather than regular Git storage. You can run
git lfs track
to see what file types are being tracked by LFS in your repository. We describe how to track additional file types below.
Third, when cloning or fetching files in an LFS-backed repository, the git internals will expand each file into memory before writing it. This can be a problem on notebook servers configured with smaller memories. On these small servers, you can use the following workaround:
GIT_LFS_SKIP_SMUDGE=1 git clone <url>
cd <dir>
git lfs fetch
This works by skipping the automatic extraction by git
and then manually extracting the files using git lfs
, which does not have the same memory constraints.
Tracking new file types¶
Only file types that are specifically tracked are stored in Git LFS rather than the standard Git storage.
To see what file types are already being tracked in a repository:
git lfs track
To track a new file type (FITS files, for example):
git lfs track "*.fits"
Git LFS stores information about tracked types in the .gitattributes
file.
This file is part of the repo and tracked by Git itself.
You can git add
, commit
and do any other Git operations against these Git LFS-managed files.
To see what files are being managed by Git LFS, run:
git lfs ls-files
Creating a new Git LFS-enabled repository¶
Configuring a new Git repository to store files with DM’s Git LFS is easy. First, initialize the current directory as a repository:
git init .
Make a file called .lfsconfig
within the repository, and write these lines into it:
[lfs]
url = https://git-lfs.lsst.cloud
locksverify = false
Next, track some file types.
For example, to have FITS and *.gz
files tracked by Git LFS,
git lfs track "*.fits"
git lfs track "*.gz"
Add and commit the .lfsconfig
and .gitattributes
files to your repository.
Add the remote repository that you’re going to push to.
git remote add origin <remote repository URL>
Configure your copy to have LFS write access–the LFS config you’re pushing has the read URL in it.
git config lfs.url https://git-lfs-rw.lsst.cloud/<org>/<repo_name>
git config lfs.locksverify false
You can then push the repo up to GitHub with
git push origin main
In the repository’s README.md
, we recommend that you include this section:
Git LFS
-------
To clone and use this repository, you'll need Git Large File Storage (LFS).
Our [Developer Guide](https://developer.lsst.io/tools/git_lfs.html)
explains how to set up Git LFS for Rubin Observatory development.