Using Git Large File Storage (LFS) for Data Repositories¶
DM uses Git LFS to manage test datasets within our normal Git workflow. Git LFS is developed by GitHub, though DM uses its own backend storage infrastructure (see SQR-001: The Git LFS Architecture for background).
All DM repositories should use Git LFS to store binary data, such as FITS files, for CI. Examples of LFS-backed repositories are lsst/afw, lsst/hsc_ci, lsst/testdata_decam and lsst/testdata_cfht.
This page describes how to use Git LFS for DM development.
Installing Git LFS¶
Git LFS requires Git version 1.8.2 or later to be installed.
Download and install the git-lfs
client by visiting the Git LFS homepage.
If you downloaded the binary release, install git-lfs
by running the provided install.sh
.
The LSST Git LFS system requires a minimum git-lfs
client version of 1.1.0.
It is recommended that users use the current stable git-lfs
client version.
Newer client versions support the more efficient batch API and have many bug fixes and performance improvements.
Most package managers also provide the git-lfs
client.
Since, LFS is a rapidly evolving technology, package managers will help you keep up with new git-lfs
releases.
For example, Mac users with Homebrew can simply run brew install git-lfs
(and later update to a new version with brew upgrade git-lfs
).
Once git-lfs
is installed, run:
git lfs install
to configure Git to use Git LFS in your ~/.gitconfig
file.
Next, decide whether you will need to push Git LFS data, or only clone and pull from Git LFS managed repositories. This affects how you set up authentication to DM’s Git LFS servers. The two configuration options are:
Option 1: Anonymous access for read-only LFS users¶
Follow these configuration instructions if you never intend to create a new Git LFS managed repository for DM, or push changes to LFS managed datasets. Skip to configuration Option 2 if this isn’t the case for you.
First, add these lines into your ~/.gitconfig
file:
# Cache anonymous access to DM Git LFS S3 servers
[credential "https://lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com"]
helper = store
[credential "https://s3.lsst.codes"]
helper = store
Then add these lines into your ~/.git-credentials
files (create one, if necessary):
https://:@lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com
https://:@s3.lsst.codes
That’s it. You’re ready to clone any of DM’s Git LFS managed repositories.
Option 2: Authenticated access for read-write LFS users¶
Follow these configuration instructions if you need to create or push changes to a DM Git LFS managed repository. Only GitHub users in the LSST GitHub organization can authenticate with DM’s storage service. If you only want read-only access to DM’s Git LFS managed repositories, return to Option 1.
First, add these lines into your ~/.gitconfig
file:
# Cache anonymous access to DM Git LFS S3 servers
[credential "https://lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com"]
helper = store
[credential "https://s3.lsst.codes"]
helper = store
Then add these lines into your ~/.git-credentials
files (create one, if necessary):
https://:@lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com
https://:@s3.lsst.codes
Next, set up a credential helper to manage your GitHub credentials (Git LFS won’t use your SSH keys). We describe how to set up a credential helper for your system in the Git set up guide.
Once a helper is set up, you can cache your credentials by cloning any of DM’s LFS-backed repositories. For example, run:
git clone https://github.com/lsst/testdata_decam.git
git clone
will ask you to authenticate with DM’s git-lfs server:
Username for 'https://git-lfs.lsst.codes': <GitHub username>
Password for 'https://<git>@git-lfs.lsst.codes': <GitHub password>
At the prompts, enter your GitHub username and password.
If you have GitHub’s two-factor authentication enabled, use a personal access token instead of a password. You can set up a personal token at https://github.com/settings/tokens.
Once your credentials are cached, you won’t need to repeat this process on your system (unless you opted for the cache-based credential helper).
That’s it. Read the rest of this page to learn how to work with Git LFS repositories.
Note
Legacy Git LFS client differences
Follow these configuration instructions if you are using a version of the Git LFS client less than 1.3.
The legacy Git LFS client has two configuration differences. First, your ~/.gitconfig
file must have an lfs section that sets the batch option to false. Second, your ~/.git-credentials
and ~/.gitconfig
files must include credentials for the git-lfs server.
First, add these lines into your ~/.gitconfig
file if you are using a legacy Git LFS client:
[lfs]
batch = false
# Cache anonymous access to DM Git LFS S3 servers
[credential "https://lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com"]
helper = store
[credential "https://s3.lsst.codes"]
helper = store
# Cache anonymous access to DM Git LFS server
[credential "https://git-lfs.lsst.codes"]
helper = store
Then add these lines into your ~/.git-credentials
file (create one, if necessary) if you are using a legacy Git LFS client:
https://:@lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com
https://:@s3.lsst.codes
https://:@git-lfs.lsst.codes
Using Git LFS-enabled repositories¶
Git LFS operates transparently to the user. Just use the repo as you normally would any other Git repo. All of the regular Git commands just work, whether you are working with LFS-managed files or not.
There are two caveats for working with LFS: HTTPS is always used, and Git LFS must be told to track new binary file types.
First, DM’s LFS implementation mandates the HTTPS transport protocol. Developers used to working with ssh-agent for passwordless GitHub interaction should use a Git credential helper, and follow the directions above for configuring their credentials.
Note this does not preclude using git+git
or git+ssh
for working with a Git remote itself; it is only the LFS traffic that always uses HTTPS.
Second, in an LFS-backed repository, you need to specify what files are stored by LFS rather than regular Git storage. You can run
git lfs track
to see what file types are being tracked by LFS in your repository. We describe how to track additional file types below.
Tracking new file types¶
Only file types that are specifically tracked are stored in Git LFS rather than the standard Git storage.
To see what file types are already being tracked in a repository:
git lfs track
To track a new file type (FITS files, for example):
git lfs track "*.fits"
Git LFS stores information about tracked types in the .gitattributes
file.
This file is part of the repo and tracked by Git itself.
You can git add
, commit
and do any other Git operations against these Git LFS-managed files.
To see what files are being managed by Git LFS, run:
git lfs ls-files
Creating a new Git LFS-enabled repository¶
Configuring a new Git repository to store files with DM’s Git LFS is easy. First, initialize the current directory as a repository:
git init .
Make a file called .lfsconfig
within the repository, and write these lines into it:
[lfs]
url = https://git-lfs.lsst.codes
Note that older versions of Git LFS used .gitconfig
rather than .lfsconfig
.
As of Git LFS version 1.1 .gitconfig has been deprecated, but support will not be dropped until LFS version 2.
Next, track some files types.
For example, to have FITS and *.gz
files tracked by Git LFS,
git lfs track "*.fits"
git lfs track "*.gz"
Add and commit the .lfsconfig
and .gitattributes
files to your repository.
You can then push the repo up to github with
git remote add origin <remote repository URL>
git push origin master
We also recommend that you include a link to this documentation page in your README
to help those who aren’t familiar with DM’s Git LFS.
In the repository’s README
, we recommend that you include this section:
Git LFS
-------
To clone and use this repository, you'll need Git Large File Storage (LFS).
Our [Developer Guide](https://developer.lsst.io/tools/git_lfs.html)
explains how to set up Git LFS for LSST development.