Search code examples
githubgraphqlgithub-api

How can I search a Github org for repos that contain a particular file (and sort the repos returned by created date of that file)


I want to search a GitHub org for repos that contain a particular file and return the repos sorted by the created date of that file.

An alternative would be to search GitHub for repos that contain a particular string - and then again return the repos sorted by the created date of the file containing that string.

I have tried using GitHub search, GitHub advanced search, and using Graphiql. I have also tried asking ChatGPT - but can't seem to get it working.

This is the closest I got in Graphiql. It is supposed to return Null if it finds the file and the text of the file if it doesn't. But this code just returns Null for every repo.

{
organization(login: "MyOrg") {
    repositories(first: 100, orderBy: {field: CREATED_AT, direction: DESC}) {
      nodes {
        name
        createdAt
        hasMkdocsYml: object(expression: "master:xyz.yaml") {
          ... on Blob {
            text
          }
        }
      }
    }
  }
}

Solution

  • I can only think of how to do it in two passes. First with gh search code (that's the legacy search API), and then iterate over the results to retrieve the file creation dates via oldest commit (commits API) for the file path.

    Something like this:

    filename=somefile.md
    owner=username
    
    gh search code --owner "$owner" --filename "$filename" --json repository,path \
        --jq 'map([.repository.nameWithOwner, .path])[] | @tsv' \
        | while IFS=$'\t' read -r repo path; do
            repo=$repo gh api -X GET "repos/$repo/commits" -f path="$path" \
                --jq 'last | {repo: $ENV.repo, date: .commit.author.date}'
        done \
        | jq -n '[inputs] | sort_by(.date)'
    

    This produces a list of objects that looks like

    [
      {
        "date": "2023-12-22T11:01:53Z",
        "repo": "owner/repo1"
      },
      {
        "date": "2024-01-08T14:09:37Z",
        "repo": "owner/repo2"
      }
    ]
    

    sorted from oldest to newest.

    • gh search code also returns filenames that contain the provided one, i.e., longnamesomefile.md would also match

    • If there are more than 30 commits that changed that file, you can increase the page size on the call to the /commits endpoint with -f per_page=50 (maximum 100)

    • If there are more than 100 commits, you have to retrieve multiple pages, and it gets considerable more complicated

    • This probably doesn't handle file renames

    • If you want nothing but the repo names, you can modify the final jq command to something like

      jq -rn '[inputs] | sort_by(.date) | map(.repo)[]'