I'm using git archive to export source code from a certain project folder src
in a repository in order to calculate its combined sha256 hash like so:
git archive HEAD --worktree-attributes -o project-archive.zip src/
sha256sum project-archive.zip | awk '{ print $1 }' > project-archive.zip.hash
My git attributes file lives at the root of the project and looks a bit like so:
integration_tests/ export-ignore
src/unit_tests export-ignore
src/.* export-ignore
.git* export-ignore
.config.yml export-ignore
*.md export-ignore
This works well for calculating the hash of my source but I'm finding that modifications to project files that aren't being included in the archive, such as .config.yml
and integration_tests/foo.py
still modify the hash of the archive.
There aren't any erroneous files in the archive itself.
The sha256 hash of each of the .py
files is unchanged.
I only see these changes to the archive hash after I've commited the unrelated (unarchived) changes, so I believe this to be a git behaviour or a misunderstanding of the git attributes config on my part.
Presumably there is some git metadata on the source files that I don't know about that effects the archive hash?
From the git archive
documentation:
git archive behaves differently when given a tree ID versus when given a commit ID or tag ID. In the first case the current time is used as the modification time of each file in the archive. In the latter case the commit time as recorded in the referenced commit object is used instead. Additionally the commit ID is stored in a global extended pax header if the tar format is used; it can be extracted using git get-tar-commit-id. In ZIP files it is stored as a file comment.
(all bold-face is mine). Since you are supplying a commit ID via HEAD
, that commit ID is stored in the zip archive as a file comment. If you zip up two different commits that, after zipping, are the same except for this file comment hash ID, the overall checksum of the zip file will differ. (Strip out the hash ID and the overall checksum of the zip files should match, except for the time-stamp issue below.)
One solution is obvious: supply the commit's tree ID rather than HEAD
, e.g., use HEAD^{tree}
. Unfortunately that will immediately run you into the first non-bolded sentence: the current time is used as the modification time of each file in the archive. So you'd have to set the computer clock back. You could keep using HEAD
literally, but then you'll get the new HEAD
commit's time, rather than the previous HEAD
commit's time, which leads right back to the same problem.
If there is some way to use an existing archive to re-set the time stamps on the new archive, or if you can compare the files shorn of time-stamps (and switch back the the old archive if all are identical) or compute a hash of the archive minus the time stamps, that would do the job.
There is no argument to git archive
that will achieve what you want. Modifying the Git source itself could allow you to specify a particular time-stamp; see this region of the source code.