Search code examples
gitgitattributesgit-lfs

How can I track text files with proper line endings normalization using git-lfs?


I have a repository to which I would like to add large text data files. Due to their number and size (which can be up to approximately 100MB in some cases), I would like to track those files with git-lfs.

I've added such a file with git lfs track data.txt, and changed the default -text (which specifies a binary file) to text=auto in the .gitattributes file (as documented in git-scm's gitattributes documentation). This gives me a .gitattributes which looks like:

data.txt filter=lfs diff=lfs merge=lfs text=auto

And just to be sure, I have refreshed the repository. Even so, it seems that the file is still tracked as a binary object and correspondingly the end-of-line conversion filter is not applied on check-out (i.e. the file is getting checked-out with the original line endings it was checked-in with).

I've also tried with text=crlf (and the variant text eol=crlf) with the same result. I have seen a number of documents and tutorials about using git-lfs but they all seem to be geared towards tracking binary files (such as *.bin, images, audio files, ...)

Is there a way to make the file tracked as a large text file (and have the end-of-lines normalized as would be for regular text files) with git-lfs?

I am currently using git-lfs 1.5.2, and git for Windows 2.10.2 (64-bit version) on a Windows 7 platform, with the core.autocrlf=true configuration.


Solution

  • After some more reading of git-scm's gitattributes and some tinkering, I was able to achieve this functionality by defining a custom filter based on git-lfs's own filter (which I found in ~/.gitconfig) and making use of Jonathan Leffler's unix-to-dos conversion with sed:

    [filter "textlfs"]
      clean = sed $'s/$/\\r/' %f | git-lfs clean
      smudge = git-lfs smudge -- %f | sed $'s/\\r$//'
      required = true
    

    which can then be used to track large text files on a Windows machine with a .gitattributes entry such as:

    data.txt filter=textlfs diff=textlfs merge=textlfs
    

    This however forces the repository users to include this custom filter definition. For convenience you may include it in a custom .gitconfig in your repository (note that this requires users to manually include the definition with git config --local include.path ../.gitconfig). This should work for users on Windows platforms, but would not be appropriate for users on platforms with different line endings (such as Linux and Mac). A more complex filter could be constructed to handle different platforms using something like:

    [filter "textlfs"]
      clean = (if [ `uname -s` == "Linux" ]; then cat %f; else sed $'s/$/\\r/' %f; fi) | git-lfs clean
      smudge = git-lfs smudge -- %f | (if [ `uname -s` == "Linux" ]; then cat; else sed $'s/\\r$//'; fi)
      required = true
    

    Finally, keep in mind that unless your large text files usually change significantly between updates or they are so big that they exceed file size limits (such as GitHub's), it may still be advantageous to handle these text files as standard text files (i.e. without git-lfs) since git can efficiently pack text files.