I'm analyzing revision histories, using git-archive
to get the files at a particular revision (see https://stackoverflow.com/a/40811494/1168342).
The approach works, but I'm trying to optimize for projects with many revisions. Much processing is wasted archiving (via tar) and back to a files in another directory (via tar again).
I'm looking for a way to do this without involving tar
, something like a git cp $revision $dest/
. Here's what I've explored so far:
I could use the git reset $revision --hard
approach with a file copy, but it renders parallelization of the analysis void, unless I create multiple copies of the repo (one for each thread/process).
There is a Java project using JGit called Doris that accomplishes this with low-level operations, but it breaks when there are weird files (e.g., links to other repos). As git has evolved, there are a lot of special cases, so I don't want to do this at a low-level if possible.
I know there's a git API for Python, but its archive feature also uses tar. For the same reasons as above, I didn't want to code this at too low a level.
Use:
mkdir <path> &&
GIT_INDEX_FILE=<path>/.git git --work-tree=<path> checkout <revision> -- . &&
rm <path>/.git
The git checkout
step will overwrite the index, so to make this parallelize well, we can just point the index file into the target. There's one file name that's pretty sure to be safe: .git
!
(This is like a lighter weight version of git worktree add
that also avoids recording the new extracted tree as an active work-tree.)
Edit to add a side note (I expect the OP is aware of this, but for future reference): note that git archive
applies certain .gitattributes
filters that this technique will not apply. In particular, git checkout
will not obey export-ignore
and export-subst
directives.