Search code examples
sqlgithashhashmap

How can I determine which files have changed across multiple git repositories?


Problem

I have created a tool which runs a code quality analysis tool against all repositories in a shared Gitlab group namespace, and outputs the results to a static site.

The tool is structured as follows:

  1. Connect to the shared Gitlab namespace and retrieve a list of projects.
  2. Loop through each project. git clone each project into a temp directory, run the code quality analysis tool, store results in an sqlite database, and delete the temp directory.
  3. Deploy results to static site.

However, Step 2 takes a large amount of time, and I would like to make it more efficient.

Solution?

My initial thought is to run the tool once to create a database. Subsequent runs would identify only files which have changed, run the code quality analysis tool against these files, and then update the existing database. This would cut down a significant amount of time as only a relatively small proportion of files will change between each run.

I believe I can use hashing to create a hash for each file in each repository, and any changes to the files would change the hash and mark the file for reassessment. I have no experience using hashing directly.

Question(s)

  • Is hashing the best method to use to identify file changes across multiple project repositories?
  • Are there any inbuilt methods to git or Gitlab which can perform this function without my needing to manual build a table of hashes?

Solution

  • Your approach seems reasonable and solves your problem very well.

    Adding my 2 cents which will give you a way forward.

    Answering to your first question regarding hashing. Yes Hashing approach is very much suitable for your use case. Reason being it involves no dependency on external tools or db for doing so.

    Hashing bytes of file across project and saving that snapshot/report of hashing result can be maintained in file or cache or any reliable storage as required by your project.

    Here are top two approaches you can take:

    1. DIY: Hashify file bytes and save that report/result in a file or any storage. Run this cron/job program in some interval. recompute hash once again and work upon only this files whose hashed strings are changed due to file changes.

    2. Automated using few existing implementations: Ideal way would be using some inbuild tools like Github/lab webhooks which lets you trigger some program (hashing logic in our case) that will do the work for you.

    Apporach 2 is slightly better as its a push based mechanism than a pull based solution of approach 1.

    Now these are just recipe for solving your problem, exact implementation will very much dependent on your requirements.

    Hope this helps.