I am trying to fetch list of all repositories from Github to do some analysis on it. I have started my job with their v3.0 API which is a Restful one and then when I needed more info like star count, migrated from v3.0 to v4.0 which is provided as GraphQL. Now I am making request for 100 records each time and doing this recursively to be able to fetch all records.
The problem is about pagination job. To have pagination work, I have to get endCursor of each request and then in the next request, I have to fill after property with this value. Now the problem is that data is not paginated properly. For example:
The query that I am sending (in a nodejs app) is as below:
{
search(query: "is:public", type: REPOSITORY, first: 100, after: "Y3Vyc29yOjEwMA==") {
repositoryCount
userCount
wikiCount
pageInfo {
startCursor
endCursor
hasNextPage
hasPreviousPage
}
edges {
node {
... on Repository {
databaseId
id
name
description
forkCount
isFork
issues {
totalCount
}
labels (first: 100) {
nodes {
name
}
}
languages (first: 100) {
nodes {
name
}
}
licenseInfo {
name
}
nameWithOwner
primaryLanguage {
name
}
pullRequests {
totalCount
}
watchers {
totalCount
}
stargazers {
totalCount
}
}
}
}
}
}
as I have previously said, first time, I remove the parameter after from the search inputs, and then use endCursor of previous request as the after param of next one.
Am I miss understanding the cursor purpose and its usage or is this a bug (intended/unintended) from Github itself?
Fortunately I have found a way that works for now. And very thanks to @Daniel Rearden for his very helpful tip. I have tested many query strings and found that, if I request an specific create date, then the data is being sorted according to this field and in my tests, now the order stays consistent and the cursor will have meaning.
The query is now this:
{
search(query: "created:2008-02-08 is:public", type: REPOSITORY, first: 100) {
repositoryCount
userCount
wikiCount
pageInfo {
startCursor
endCursor
hasNextPage
hasPreviousPage
}
edges {
node {
... on Repository {
databaseId
id
name
description
forkCount
isFork
issues {
totalCount
}
labels (first: 100) {
nodes {
name
}
}
languages (first: 100) {
nodes {
name
}
}
licenseInfo {
name
}
nameWithOwner
primaryLanguage {
name
}
pullRequests {
totalCount
}
watchers {
totalCount
}
stargazers {
totalCount
}
createdAt
updatedAt
diskUsage
}
}
}
}
}
Now the only thing I should think about is to scroll over days and make this query many times on each day, until pageInfo.hasNextPage
is true.
By now I have not tested this for all ~4000 days and may be I can't verify that the fetched result is all data that exists in their DB, but it seems to be the best solution.