Search code examples
mercurialdvcssha1changeset

Assignment of mercurial global changeset id


Apparently Mercurial assigns a global changeset id to each change. How do they ensure that this is unique?


Solution

  • As Zach says, the changeset ID is computed using the SHA-1 hash function. This is an example of a cryptographically secure hash function. Cryptographic hash functions take an input string of arbitrary length and produces a fixed-length digest from this string. In the case of SHA-1, the output length is fixed to 160 bit, of which Mercurial by default only shows you the first 48 bit (12 hexadecimal digits).

    Cryptographic hash functions have the property that it is extremely difficult to find two different inputs that produce the same output, that is, it is hard to find strings x != y such that H(x) == H(y). This is called collision resistance.

    Since Mercurial uses the SHA-1 function to compute the changeset ID, you get the same changeset ID for identical inputs (identical changes, identical committer names and dates). However, if you use different inputs (x != y) when you will get different outputs (changeset IDs) because of the collision resistance.

    Put differently, if you do not get different changeset IDs for different input, then you have found a collision for SHA-1! So far, nobody has ever found a collision for SHA-1, so this will be a major discovery.


    In more detail, the SHA-1 hash function is used in a recursive way in Mercurial. Each changeset hash is computed by concatenating:

    • manifest ID
    • commit username
    • commit date
    • affected files
    • commit message
    • first parent changeset ID
    • second parent changeset ID

    and then running SHA-1 on all this (see changelog.py and revlog.py). Because the hash function is used recursively, the changeset hash will fix the entire history all the way back to the root in the changeset graph.

    This also means that you wont get the same changeset ID if you add the line Hello World! to two different projects at the same time with the same commit message -- when their histories are different (different parent changesets), the two new changesets will get different IDs.