Search code examples
githubpowerbipowerquerygithub-api

WorkAround for GitHub Rate Limit on powerbi Custom Query


I am a powerbi developer. my company has many organizations under their name and each organization have many repos (around 300+).

I was able to write 4 tables in power query powerbi Table 1. Get all organization and repo names.

I will use table 1 as reference to my next 3 table so I don't call the same code twice to get repos.

Table 2. Get issues of all states for each repo. Table 3. Get pulls of all states for each repo. Table 4. Get commits for each repo.

however, the data can never fully load because I always hit the 5000-rate limit hourly even though i limited the data since 2 months ago only.

is there a workaround? I am not abusing the Api for web scarping I want to analyse the performance of our developers. like maybe a custom connector, a different approach. i prefer keeping things free

i tried limiting dates, i tried reading GraphQL but i have no idea how to translate it to powerquery and i am no computer science guru. i only know python and m query as programming languages.


Solution

  • There's no way to bypass the API rate limits because if there were, people would use it. GitHub, like every other major website, has API rate limits to prevent excessive usage which might lead to an outage.

    I'm not familiar with PowerBI and how it stores data, but you'll need to do these operations over the course of multiple hours because you're asking for a lot of data. You can first request the organization and repo names and store those. If you need to update them (or other data), use a conditional request, which will help you easily determine if any changes have occurred without using a lot of requests.

    Once you have the repositories, you can get the issues and pull requests, but you should probably decide why you need all of that data. Maybe you only need some of them and can request certain ones using an appropriate API endpoint that requires fewer requests. Again, conditional requests are your friend if you ever need to refresh this data.

    In general, asking for all the commits for a reasonably active repo is a bad idea because there are going to be a lot. The git/git repository, which is smaller than many corporate repositories, has over 73,000 commits, and it's simply impractical to crawl all of those in any meaningful way via the API. If you really need information from the commits, clone the repositories and extract the information that way, which will be substantially faster and more efficient than querying this via the API. You can use a shallow clone, which is much cheaper as long as you don't fetch into it, if you only need information from a certain number of commits.

    At each stage, you'll need to save the data you've got, honouring the rate limit, and not throw that data away, or you'll never be able to fetch all of the data.