Search code examples
githubduplicatesgit-diffsha1

github, SHA-1 hash and git duplicates


I have a github repo which seems to have duplicate commits. Each commit has the same message string and the same date/author, but different SHA-1 hash sums. For example, in my log I found the following quadruple:

 'commit 55e55517bf32b7ba7382b97f41a1514af8a5f5dc',
 'Author: dermen <[email protected]>',
 'Date:   Tue Feb 19 20:03:35 2013 -0800',
 'finished with the cromermann edition',
 'commit 814fb08e0d42588a500947cba42a980ac24c01b8',
 'Author: dermen <[email protected]>',
 'Date:   Tue Feb 19 20:03:35 2013 -0800',
 'finished with the cromermann edition',
 'commit a5f581f513d12e95627669f61cfe27064ffe8319',
 'Author: dermen <[email protected]>',
 'Date:   Tue Feb 19 20:03:35 2013 -0800',
 'finished with the cromermann edition',
 'commit a264614b674e1ad2c4c8cc953cb27cf77c0d2615',
 'Author: dermen <[email protected]>',
 'Date:   Tue Feb 19 20:03:35 2013 -0800',
 'finished with the cromermann edition',

Everything is identical except for the SHA-1 hash. When I run for example

git diff 55e55517bf32b7ba7382b97f41a1514af8a5f5dc    814fb08e0d42588a500947cba42a980ac24c01b8

I get zero output - doesnt this mean the commits are identical. If this is true, then why would they have separate SHA-1 hash sums ? Maybe I mis-understand, but shouldn't an SHA-1 hash directly represent content in a file? Hence if the commits are equal then they should have the same hash.

In any case, I am wondering if it is wise / unwise to attempt to filter such apparent duplicates... Any tip/redirection will be appreciated.


Solution

  • A git commit sha is generated from the following information

    • commit message
    • author signature (identity + timestamp)
    • committer signature (identity + timestamp)
    • tree sha (hierarchy of directories and files witin the commit)
    • list of the shas of the parent commits

    As the shas are different, this is because at least one of these information differ.

    In order to get a better understanding of what are those data for each commit (and how they differ one from another) you can run the following command to get the raw output of each commit

    $ git show --format=raw <commit_sha>
    

    Example of the output of this command

    Based on a random commit of the libgit2 project

    $ git show --format=raw eb58e2d
    commit eb58e2d0be4e07c2ef873a5f0562eaa90826c2de
    tree 41959050b1e3adb428e140102a0c321949be516b
    parent 3b5001b4c911db9c47d62399c1adc03bd8a3ca72
    parent 3e9e6cdaff8acb11399736abbf793bf2d000d037
    author Vicent Marti <[email protected]> 1371063948 +0200
    committer Vicent Marti <[email protected]> 1371063948 +0200
    
        Merge remote-tracking branch 'arrbee/minor-paranoia' into development
    
    diff --cc src/refdb.c
    index 359842e,4271b58..6da409a
    --- a/src/refdb.c
    +++ b/src/refdb.c
    @@@ -86,9 -86,10 +86,10 @@@ int git_refdb_compress(git_refdb *db
            return 0;
      }
    
     -static void refdb_free(git_refdb *db)
     +void git_refdb__free(git_refdb *db)
      {
            refdb_free_backend(db);
    +       git__memset(db, 0, sizeof(*db));
            git__free(db);
      }
    

    Back to your questions

    I get zero output - doesn't this mean the commits are identical

    This means that the content of what is being pointed at by the commits is the same. But the metadata may certainly differ.

    Maybe I mis-understand, but shouldn't an SHA-1 hash directly represent content in a file?

    In Git, SHA-1 hashes are used to represent git objects: blobs (i.e. files), trees (i.e. list of blobs and sub trees) and commits. You can find more information about this in the chapter 9.2 Git Internals - Git Objects of the Pro Git book.

    For example, in my log I found the following quadruple

    This may happen when you amend/rebase/fixup the content of your commits for instance. In these cases, only the commit date would change.

    In any case, I am wondering if it is wise / unwise to attempt to filter such apparent duplicates

    You don't have to cleanup by yourself. Those objects are stored in the Git object database. Git implements a garbage collecting mechanism which will regularly and automatically remove orphaned objects from it (see git-gc documentation for more details).