Problem
I have created a tool which runs a code quality analysis tool against all repositories in a shared Gitlab group namespace, and outputs the results to a static site.
The tool is structured as follows:
git clone
each project into a temp directory, run the code quality analysis tool, store results in an sqlite database, and delete the temp directory.However, Step 2 takes a large amount of time, and I would like to make it more efficient.
Solution?
My initial thought is to run the tool once to create a database. Subsequent runs would identify only files which have changed, run the code quality analysis tool against these files, and then update the existing database. This would cut down a significant amount of time as only a relatively small proportion of files will change between each run.
I believe I can use hashing to create a hash for each file in each repository, and any changes to the files would change the hash and mark the file for reassessment. I have no experience using hashing directly.
Question(s)
Your approach seems reasonable and solves your problem very well.
Adding my 2 cents which will give you a way forward.
Answering to your first question regarding hashing. Yes Hashing approach is very much suitable for your use case. Reason being it involves no dependency on external tools or db for doing so.
Hashing bytes of file across project and saving that snapshot/report of hashing result can be maintained in file or cache or any reliable storage as required by your project.
Here are top two approaches you can take:
DIY: Hashify file bytes and save that report/result in a file or any storage. Run this cron/job program in some interval. recompute hash once again and work upon only this files whose hashed strings are changed due to file changes.
Automated using few existing implementations: Ideal way would be using some inbuild tools like Github/lab webhooks which lets you trigger some program (hashing logic in our case) that will do the work for you.
Apporach 2 is slightly better as its a push based mechanism than a pull based solution of approach 1.
Now these are just recipe for solving your problem, exact implementation will very much dependent on your requirements.
Hope this helps.