Search code examples
passwordsscanningsensitive-data

How to search in all Github public Repos, if anyone has copied my company code / important data on their own public Github repository


I am new to the GitHub code and data search kind of thing. My motive is to search / scan A to Z Public Repositories of GitHub/Git to make sure that noone has copied my company source code or sensitive data.

I am thinking about the below challenges;

  1. How to get a list of A to Z public repositories on GitHub.
  2. How to scan my data, may be in the millions of repositories.
  3. If there is a way to scan Words directory with any script or code.

Please give me a guide for this.

Thanks a lot for quick help (in advance!)

Abhishek


Solution

  • Welcome to StackOverflow!

    Your best bet is to use Github's search API to find code that you are interested. For example, using Github's search (not through the API) for my domain name, I was able to find code that I've committed.

    However, keep in mind that this won't solve your problem of making sure no one has copied your source code. There are countless git services: GitHub, GitLab, Bitbucket, just to name a few. Besides that, you also have to contend with private repositories where searching wouldn't go. It is impossible to search everything. Your best bet is to have safe-guards in place to prevent it from happening such as having strict access controls, ensuring your employees as well as any vendors you work with understand and agree to company policy regarding data.

    Finally, having a good responsible disclosure program will encourage white-hat hackers to inform you of any breaches.

    Now, with all that in mind, I still think creating a small bot to search the popular places like github, etc. is not a bad idea. Another thing you could do is create a canary, where you have an object that's sole job is to be uniquely identifiable so that if there is a breach, your search can find it easily.

    A canary can be a unique row in a database, a specific file with unique text within it, etc. where you can do a search for that text regularly and if it comes up, you know that there was a breach.