Search code examples
pythoncsvgithubautomationgit-commit

Retrieve all/recent commits history from all branches from a github code organization to CSV/Json


I want to fetch all commits history from GitHub code organization consisting of 225+ code repos private as well as public. I saw a lot of other solutions in google and stackoverflow, but couldn't concede a single one. I am looking for an automated solution where we can fetch or retrieve all the commits history once and then scheduled it from a particular date According to GitHub Api's I wasn't able to do it as it has no of API(s) hits restriction per day to the GitHub server.

primarily I am trying to fetch all commits information into a CSV file. Kindly share if any python code/script serves this purpose.


Solution

  • Had a similar requirement earlier. Solved it using below python logic. I hope you have performed cloned for all the code repos and your ssh personal access key is added to your git global configuration.

    1. For all the repos, you can add them to a list and loop it across to fetch the remote origin updates/ref-HEADs for all branches for the mentioned code repos in the list.
    2. Used git log to get all the commits initially using git log --all and then automate it to get the weekly/monthly as per your requirement using git log --all after={date}
    3. Used csv module to write the git log using pretty format and appending the output to a csv file consistently.

    Please find the below code snippet.

    import os
    import csv
    
    #Dummy Blank CSV for adding newline in csv module
    dummy = "<Path:>/dummy.csv"
    
    #Loop through your set of repos 
    list = ["Code repo 1", "Code repo 2"]
    for repo in list:
        os.chdir(repo)
        # git pull all the remote origin updates from all branches
        cmd1 = "git pull --all".format(repo)
        os.system(cmd1)
        #git log all (for initial log) & then update it with --after=<date> (from a specified date - you can automate/schedule it)
        cmd2 = "git log --all --after=2021-06-10 --pretty=format:'{},%h,%an,%ad,%s' > {}.csv".format(repo,repo)
        os.system(cmd2)
        src = "{}.csv".format(repo)
        #To append here as CSV I have used csv module
        tf = open('<Path:>/Gitlog-output.csv', 'a+', newline="")
        if os.path.getsize('<Path:>/{}.csv'.format(repo)) != 0:
            #Writing each git log data to the above output file and conditional newline if there isn't a commit in any branch.
            tf.write(open(src).read())
            tf.write(open(dummy).read())
        tf.close()
    
        print("Finished logging {}".format(repo))
        # To track the list of remaining repos from your list
        print("Remaining Repos: {}".format(len(list) - list.index(repo) -1 ))
        print("#####################################")