Search code examples
gitgithubtagsbranchgit-commit

How can a specific git repo contain only one branch, and yet each of its 94 tags returns files which do not appear in the branch?


I see in this publicly available GitHub repo, https://github.com/ServiceNow/devtraining-needit-quebec, only one branch, main, which contains three small files and a folder with a small file:

  • update [folder]
    • sys_ui_view_5bd975bd0fa03200cd674f8ce1050e7f.xml
  • README.md
  • checksum.txt
  • sys_app_6ead8e780f603200cd674f8ce1050ed1.xml

There is only one branch, main, but there are 94 tags. Each tag has associated with it 2 "assets" which appear unique to the tag. Every tag has a source code .zip and source code .tgz. These archive files appear to me to each be unique with a unique set of files in them. By unique I mean all files in all archives and the archives themselves are mutually exclusive of each other.

My understanding of git is that each commit has a unique identifier which represents a pointer to the changes between the sum of all previous chunks and the current. Each "file" will always be part of a commit, and each commit must always be in at least one branch. In other words no commit can exist outside of a branch otherwise it is not a commit. Tags are only a unique bookmark to the state of a repo. It could have also been the state of a branch at a point in time.

How then, can a tag represent a unique set of files that lives completely outside of any branch? The only explanations I can imagine so far are 1. my understanding of git is wrong or incomplete. 2. GitHub has extended git in some way and my understanding of GitHub is also wrong or incomplete.


Solution

  • Let's start with this correction, which is itself somewhat minor but does get used soon:

    My understanding of git is that each commit has a unique identifier

    Yes: this is a hash ID, currently SHA-1 hashes.

    which represents a pointer to the changes between the sum of all previous chunks and the current.

    No: each commit holds metadata (information such as who made it and when) and one full snapshot. No commit holds changes (well, except if some snapshotted file is itself composed of data that represents changes). Technically even the snapshot itself is indirect, through a tree line in the metadata, but all commits are required to have exactly one tree.

    On (or back) to the tags:

    Each tag has associated with it 2 "assets" which appear unique to the tag. Every tag has a source code .zip and source code .tgz.

    These are likely generated by GitHub on the fly. The tag names hold unique identifiers: hash IDs. Each hash ID locates either a commit object (making it a "lightweight tag", which GitHub won't call a release) or a tag object (or annotated tag object). An annotated tag object holds metadata, and part of this metadata is normally the hash ID of some commit.

    How then, can a tag represent a unique set of files that lives completely outside of any branch?

    Back to the branch stuff:

    Each "file" will always be part of a commit,

    Not necessarily, but usually. Git stores these in parts: tree objects hold path-name components, and blob objects store file content (which is shared by every user of the same content). These objects also have hash IDs. Content isn't necessarily unique—a blob holding the literal ../symlink, for instance, might represent both a symbolic link (via a file name stored in a tree object with mode 120000) and a file content (via another file name stored in some tree object, but this time with mode 100644). As a data file, this is a file that does not have a newline at the end of it, but it's a valid data file.

    (In fact, we can calculate its hash: the length of ../symlink is ten bytes, so we want the SHA1 sum of blob 10\0../symlink:

    $ printf 'blob 10\0../symlink' | sha1
    54f939943aafe4022f2d20855230e33cafe1a8f9
    

    or:

    $ printf '../symlink' | git hash-object -t blob --stdin
    54f939943aafe4022f2d20855230e33cafe1a8f9
    

    so every data file with exactly this content has exactly this hash ID.)

    and each commit must always be in at least one branch.

    No; commits are in zero or more branches. (Aha :-) )

    For a commit to be in a branch, it must be reachable from the branch tip. A branch name is a reference of the form refs/heads/name and it holds the hash ID of a commit object. That commit object's metadata holds, among other things, the hash IDs of any immediate-predecessor or parent commits, and the transitive closure across all these commits tell us which commits are "in the branch".

    For a commit to remain live, however, it merely needs to be reachable from any reference. Tag names are references of the form refs/tags/name. As long as they hold the hash ID of a commit, or of a tag object that names a commit, that commit and all its predecessors are retained.

    The .tar and .zip files would be those that GitHub makes using git archive and the target commit of the tag. They may well cache these things as well, but that part is up to GitHub.