Using Git Large File Storage (LFS) for Data Repositories

DM uses Git LFS to manage test datasets within our normal Git workflow. Git LFS is developed by GitHub, though DM uses its own backend storage infrastructure (see SQR-001: The Git LFS Architecture for background).

All DM repositories should use Git LFS to store binary data, such as FITS files, for CI. Examples of LFS-backed repositories are lsst/afw, lsst/hsc_ci, lsst/testdata_decam and lsst/testdata_cfht.

This page describes how to use Git LFS for DM development.

Installing Git LFS

Git LFS requires Git version 1.8.2 or later to be installed.

Download and install the git-lfs client by visiting the Git LFS homepage. If you downloaded the binary release, install git-lfs by running the provided install.sh.

Most package managers also provide the git-lfs client. Since, LFS is a rapidly evolving technology, package managers will help you keep up with new git-lfs releases. For example, Mac users with Homebrew can simply run brew install git-lfs and brew upgrade git-lfs.

Once git-lfs is installed, run:

git config --global lfs.batch false
git lfs install

to configure Git to use Git LFS in your ~/.gitconfig file.

Next, decide whether you will need to push Git LFS data, or only clone and pull from Git LFS managed repositories. This affects how you set up authentication to DM’s Git LFS servers. The two configuration options are:

  1. Anonymous access for read-only LFS users.
  2. Authenticated access for read-write LFS users.

Option 1: Anonymous access for read-only LFS users

Follow these configuration instructions if you never intend to create a new Git LFS managed repository for DM, or push changes to LFS managed datasets. Skip to configuration Option 2 if this isn’t the case for you.

First, add these lines into your ~/.gitconfig file:

# Cache anonymous access to DM Git LFS S3 servers
[credential "https://lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com"]
    helper = store
[credential "https://s3.lsst.codes"]
    helper = store

# Cache anonymous access to DM Git LFS server
[credential "https://git-lfs.lsst.codes"]
    helper = store

Then add these lines into your ~/.git-credentials files (create one, if necessary):

https://:@lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com
https://:@s3.lsst.codes
https://:@git-lfs.lsst.codes

That’s it. You’re ready to clone any of DM’s Git LFS managed repositories.

Option 2: Authenticated access for read-write LFS users

Follow these configuration instructions if you need to create or push changes to a DM Git LFS managed repository. Only GitHub users in the LSST GitHub organization can authenticate with DM’s storage service. If you only want read-only access to DM’s Git LFS managed repositories, return to Option 1.

First, add these lines into your ~/.gitconfig file:

# Cache anonymous access to DM Git LFS S3 servers
[credential "https://lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com"]
    helper = store
[credential "https://s3.lsst.codes"]
    helper = store

Then add these lines into your ~/.git-credentials files (create one, if necessary):

https://:@lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com
https://:@s3.lsst.codes

Next, set up a credential helper to manage your GitHub credentials (Git LFS won’t use your SSH keys). We describe how to set up a credential helper for your system in the Git set up guide.

Once a helper is set up, you can cache your credentials by cloning any of DM’s LFS-backed repositories. For example, run:

git clone https://github.com/lsst/testdata_decam.git

git clone will ask you to authenticate with DM’s git-lfs server:

Username for 'https://git-lfs.lsst.codes': <GitHub username>
Password for 'https://<git>@git-lfs.lsst.codes': <GitHub password>

At the prompts, enter your GitHub username and password.

If you have GitHub’s two-factor authentication enabled, use a personal access token instead of a password. You can set up a personal token at https://github.com/settings/tokens.

Once your credentials are cached, you won’t need to repeat this process on your system (unless you opted for the cache-based credential helper).

That’s it. Read the rest of this page to learn how to work with Git LFS repositories.

Using Git LFS-enabled repositories

Git LFS operates transparently to the user. Just use the repo as you normally would any other Git repo. All of the regular Git commands just work, whether you are working with LFS-managed files or not.

There are two caveats for working with LFS: HTTPS is always used, and Git LFS must be told to track new binary file types.

First, DM’s LFS implementation mandates the HTTPS transport protocol. Developers used to working with ssh-agent for passwordless GitHub interaction should use a Git credential helper, and follow the directions above for configuring their credentials.

Note this does not preclude using git+git or git+ssh for working with a Git remote itself; it is only the LFS traffic that always uses HTTPS.

Second, in an LFS-backed repository, you need to specify what files are stored by LFS rather than regular Git storage. You can run

git lfs track

to see what file types are being tracked by LFS in your repository. We describe how to track additional file types below.

Tracking new file types

Only file types that are specifically tracked are stored in Git LFS rather than the standard Git storage.

To see what file types are already being tracked in a repository:

git lfs track

To track a new file type (FITS files, for example):

git lfs track "*.fits"

Git LFS stores information about tracked types in the .gitattributes file. This file is part of the repo and tracked by Git itself.

You can git add, commit and do any other Git operations against these Git LFS-managed files.

To see what files are being managed by Git LFS, run:

git lfs ls-files

Creating a new Git LFS-enabled repository

Configuring a new Git repository to store files with DM’s Git LFS is easy. First, initialize the current directory as a repository:

git init .

Make files called .lfsconfig and .gitconfig within the repository, and write these lines into both:

[lfs]
     url = https://git-lfs.lsst.codes
     batch = false

Note that older versions of Git LFS used .gitconfig rather than .lfsconfig. As of Git LFS version 1.1 .gitconfig has been deprecated, but support will not be dropped until LFS version 2. New repositories should still use both configuration files since DM’s Jenkins server still uses a pre-1.1 git-lfs client.

Next, track some files types. For example, to have FITS and *.gz files tracked by Git LFS,

git lfs track "*.fits"
git lfs track "*.gz"

Add and commit the .lfsconfig, .gitconfig, and .gitattributes files to your repository.

You can then push the repo up to github with

git remote add origin <remote repository URL>
git push origin master

We also recommend that you include a link to this documentation page in your README to help those who aren’t familiar with DM’s Git LFS.

In the repository’s README, we recommend that you include this section:

Git LFS
-------

To clone and use this repository, you'll need Git Large File Storage (LFS).

Our [Developer Guide](http://developer.lsst.io/en/latest/tools/git_lfs.html)
explains how to set up Git LFS for LSST development.