Search code examples
git

How to find/identify large commits in git history?


I have a 300 MB git repo. The total size of my currently checked-out files is 2 MB, and the total size of the rest of the git repo is 298 MB. This is basically a code-only repo that should not be more than a few MB.

I suspect someone accidentally committed some large files (video, images, etc), and then removed them... but not from git, so the history still contains useless large files. How can find the large files in the git history? There are 400+ commits, so going one-by-one is not practical.

NOTE: my question is not about how to remove the file, but how to find it in the first place.


Solution

  • I've found this script very useful in the past for finding large (and non-obvious) objects in a git repository:


    #!/bin/bash
    #set -x 
     
    # Shows you the largest objects in your repo's pack file.
    # Written for osx.
    #
    # @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
    # @author Antony Stubbs
     
    # set the internal field separator to line break, so that we can iterate easily over the verify-pack output
    IFS=$'\n';
     
    # list all objects including their size, sort by size, take top 10
    objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
     
    echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
     
    output="size,pack,SHA,location"
    allObjects=`git rev-list --all --objects`
    for y in $objects
    do
        # extract the size in bytes
        size=$((`echo $y | cut -f 5 -d ' '`/1024))
        # extract the compressed size in bytes
        compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
        # extract the SHA
        sha=`echo $y | cut -f 1 -d ' '`
        # find the objects location in the repository tree
        other=`echo "${allObjects}" | grep $sha`
        #lineBreak=`echo -e "\n"`
        output="${output}\n${size},${compressedSize},${other}"
    done
     
    echo -e $output | column -t -s ', '
    

    That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one:

    ... to find the commit that points to each of those blobs.