Using Git LFS (Large File Storage) for data repositories

This page describes how to use Git LFS for DM development.

DM uses Git LFS to manage test datasets within our normal Git workflow. Git LFS is developed by GitHub, though DM uses its own backend storage infrastructure (see SQR-001: The Git LFS Architecture for background).

All DM repositories should use Git LFS to store binary data, such as FITS files, for CI. Examples of LFS-backed repositories are lsst/afwdata, lsst/testdata_ci_hsc, lsst/testdata_decam and lsst/testdata_cfht.

On this page

Installing Git LFS

In most Science Pipelines installations, including those in the Rubin Science Platform, git lfs is already installed as part of the rubin-env conda metapackage.

Otherwise, you can download and install the git-lfs client by visiting the Git LFS homepage. Many package managers, like Homebrew on the Mac, also provide git-lfs (brew install git-lfs for example).

We recommend using the latest Git LFS client. The minimum usable client version for LSST is git-lfs 2.3.4.

Git LFS requires Git version 1.8.2 or later to be installed.

Before you can use Git LFS with LSST data you’ll need to configure it by following the next section.

Configuring Git LFS

Basic configuration

Run this command to add a filter "lfs" section to ~/.gitconfig. This command, and the LSST configuration below, have to be done once on every machine you are planning to work with LSST LFS repos on.

git lfs install

Configuration for LSST

LSST uses its own Git LFS servers. This section describes how to configure Git LFS to pull from LSST’s servers.

First, add these lines into your ~/.gitconfig file:

# Cache anonymous access to DM Git LFS S3 servers
[credential "https://lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com"]
    helper = store
[credential "https://s3.lsst.codes"]
    helper = store

Then add these lines into your ~/.git-credentials files (create one, if necessary):

https://:@lsst-sqre-prod-git-lfs.s3-us-west-2.amazonaws.com
https://:@s3.lsst.codes

Trying cloning a small data repository to test your configuration:

git clone https://github.com/lsst/testdata_subaru

If the resulting new directory is about 220MB in size, as measured by du -sh testdata_subaru, you are correctly configured for Git LFS use.

Authenticating for push access

If you want to push to a LSST Git LFS-backed repository you’ll need to configure and cache your credentials, as described at Git authentication tokens.

Using Git LFS-enabled repositories

Git LFS operates transparently to the user. Just use the repo as you normally would any other Git repo. All of the regular Git commands just work, whether you are working with LFS-managed files or not.

There are three caveats for working with LFS: HTTPS is always used, Git LFS must be told to track new binary file types, and you usually need enough memory to hold the largest file.

First, DM’s LFS implementation mandates the HTTPS transport protocol. Developers used to working with ssh-agent for passwordless GitHub interaction should use a Git credential helper, and follow the directions above for configuring their credentials.

Note this does not preclude using git+git or git+ssh for working with a Git remote itself; it is only the LFS traffic that always uses HTTPS.

Second, in an LFS-backed repository, you need to specify what files are stored by LFS rather than regular Git storage. You can run

git lfs track

to see what file types are being tracked by LFS in your repository. We describe how to track additional file types below.

Third, when cloning or fetching files in an LFS-backed repository, the git internals will expand each file into memory before writing it. This can be a problem on notebook servers configured with smaller memories. On these small servers, you can use the following workaround:

GIT_LFS_SKIP_SMUDGE=1 git clone <url>
cd <dir>
git lfs fetch

This works by skipping the automatic extraction by git and then manually extracting the files using git lfs, which does not have the same memory constraints.

Tracking new file types

Only file types that are specifically tracked are stored in Git LFS rather than the standard Git storage.

To see what file types are already being tracked in a repository:

git lfs track

To track a new file type (FITS files, for example):

git lfs track "*.fits"

Git LFS stores information about tracked types in the .gitattributes file. This file is part of the repo and tracked by Git itself.

You can git add, commit and do any other Git operations against these Git LFS-managed files.

To see what files are being managed by Git LFS, run:

git lfs ls-files

Creating a new Git LFS-enabled repository

Configuring a new Git repository to store files with DM’s Git LFS is easy. First, initialize the current directory as a repository:

git init .

Make a file called .lfsconfig within the repository, and write these lines into it:

[lfs]
     url = https://git-lfs.lsst.codes

Next, track some files types. For example, to have FITS and *.gz files tracked by Git LFS,

git lfs track "*.fits"
git lfs track "*.gz"

Add and commit the .lfsconfig and .gitattributes files to your repository.

You can then push the repo up to GitHub with

git remote add origin <remote repository URL>
git push origin main

We also recommend that you include a link to this documentation page in your README to help those who aren’t familiar with DM’s Git LFS.

In the repository’s README, we recommend that you include this section:

Git LFS
-------

To clone and use this repository, you'll need Git Large File Storage (LFS).

Our [Developer Guide](https://developer.lsst.io/tools/git_lfs.html)
explains how to set up Git LFS for LSST development.