Search code examples
gittreegithooksgit-filter

How to save a file as a git tree instead of a blob?


Since many file formats have a tree-like structure (e.g. XML, tar, even MP3 if you consider splitting tags and frames apart as leafs), I was wondering if there is any way to let git store them as tree objects instead of blobs, to exploit the structure e.g. for diffing and merging.

So far I've considered using hooks or smudge/clean-filters, but both have shortcomings which I'd like to avoid:

  • Using a clean-filter, which only rewrites the blob (i.e. file contents), I could create and git add a tree in parallel and replace the blob by sufficient information for the smudge-filter to recreate the original file on checkout. However, that would make git status claim the "directory" the tree internally represents is missing, there'd be a placeholder file which would probably prevent a tree of the same name to be added
  • Using a post-commit hook would mess up with git diff etc. a lot

So is there any sensible way to achieve this? Or should I stick to blobs and maybe modify the merge/diff driver instead?


Solution

  • Git itself attempts to be content-form agnostic. That is, to a first approximation it cares only about raw data—not even text vs binary, just "here is some data as a collection of files; please store it as such." (Linus' original vision did not, I think, include CR/LF conversion, and as long as that's never turned on, it will not damage binary data.)

    This agnosticism quickly breaks down. Comparing one commit with another starts by comparing files, but beyond the simplistic "pathname p/a/t/h in commit A must mean the same file as pathname p/a/t/h in commit B"—which works great when both paths exist and do name the same content—we quickly find that we need to compare similar-but-not-identical files and wish to do so on some sort of structural basis: line or word oriented diff, for instance. And, to handle renaming issues, if p/a/t/h becomes p/t/h or vice versa, we might like to match these files against each other even if they're only, say, 90% similar.

    (Other VCSes record some other kind of file identity, not just path-name, with each commit, either by recording directory operations or by assigning unique internal IDs to files. Git doesn't, so it has to rely on this similarity-detection system. Git's similarity detector is peculiar: it's not quite line oriented so that it can work on binary files, but it does detect line boundaries to eliminate \r\n vs \n changes from its similarity detector.)

    Anyway, you certainly could take Git and modify it to add new object types that are "like trees" but with a different flavor. That would let you pick apart these structured files. How well it would work seems basically a research topic. Just jamming them in as trees would clearly not work so well, though: you would never know if some tree instance were a "derived tree" or a "real tree". To avoid changing some of Git's core code, you could perhaps insert your real vs derived/synthetic tree conversion at the point where Git reads and writes its index, and encode "real" vs "synthetic" into the "file names".

    If the raw data stored in each synthetic sub-tree is itself binary, you'll run into the usual headaches that diffs are unusable. The pack file storage format (based on xdelta) is not line-oriented but the pack heuristics, which make use of the path names, might not perform terribly well, so you might want to modify those too. The amount of compression you get from xdelta depends on the Shannon entropy in the input data: binary vs text is not actually an issue here, except insofar as typical text input has quite low entropy; binary inputs tend to be less predictable.