The response to each GA request returns a certain number of rows (with a maximum of 10,000). If your first request defines a query that will result in more than 10,000 rows (say for the example it results in 26,000 rows), then only the first 10,000 rows will be returned. Then you have to make another request (with the same query), specifying that you want the next 10,000 rows starting at 10,001, then another request that specifies you want the rows after 20,001.
My question is does the Pentaho Google analytics plugin do this under the hood? I cannot seem to find any meaningful documentation anywhere on the subject. Thanks in advance for any information you can provide.
So according to Google the default maxResults setting is 1,000. The GA PDI component is open source so the code is easily accessible, after a quick scan of their Java code it looks like internally the component uses the default MaxResults
per request (1,000) and then continues to page over the remaining result set in chunks of 1,000. This is what I had assumed but it's good to be sure that the component will get all of your data that exceeds 10,000 rows per result set. Now the only thing i'm not sure about is if this will play well with the Google 10 queries per second (QPS) per IP
quota limit.
GAInputstep.java:
private DataEntry getNextDataEntry() throws KettleException {
// no query prepared yet?
if (data.query == null){
data.query = getQuery();
// use default max results for now
//data.query.setMaxResults(10000);
...
}
// query is there, check whether we hit the last entry and requery as necessary
else if (data.entryIndex >= data.feed.getEntries().size()){
if (data.feed.getStartIndex()+data.entryIndex <= data.feed.getTotalResults()){
// need to query for next page
data.query.setStartIndex(data.feed.getStartIndex()+data.entryIndex);