Search code examples
google-analyticsgoogle-analytics-api

Core Reporting API v3 - Data sampled from specific date, but not before that date


I have a Google Analytics account, with a view that was created on 2015-07-29.

Making a request to the core reporting api with 2015-07-29 as the start-date:

https://www.googleapis.com/analytics/v3/data/ga?ids=<my-ga-id>&dimensions=ga:medium,ga:year,ga:month,ga:channelGrouping&metrics=ga:transactions&start-date=2015-07-29&end-date=2017-03-30&max-results=10000

I get the following response:

{
...
  "containsSampledData": true,
  "sampleSize": "498617",
  "sampleSpace": "1022430",
...
}

Which makes perfect sense - it is sampling the data, because of the number of sessions.

However, if I change my request to the core reporting api, so that now 2015-07-28 is the start-date:

https://www.googleapis.com/analytics/v3/data/ga?ids=<my-ga-id>&dimensions=ga:medium,ga:year,ga:month,ga:channelGrouping&metrics=ga:transactions&start-date=2015-07-28&end-date=2017-03-30&max-results=10000

I get the following response:

{
...
   "containsSampledData": false
...
}

The data is no longer sampled, and yields the correct values (compared to Google Analytics Web UI).

If then add the metric ga:sessions to the request with start-date=2015-07-28, I get sampled data.

My question is:

Why is the data sampled if the start-date is equal to or later than the date, the Google Analytics view was created? - If it is before that date, the data is no longer sampled? - But it is sampled as soon as I put in the metric ga:sessions?


Solution

  • In data analysis, sampling is the practice of analyzing a subset of all the data in order to uncover the meaningful information in the larger data set. For example, during an election cycle, you hear lots of news about what percent of voters prefer one candidate over another, or are for or against a certain issue. Because there can be tens to hundreds of millions of voters in an election, and because the companies conducting the surveys want to get their information out to the public as soon as possible, trying to question every voter for every new survey would be extraordinarily expensive and take too much time. To solve those problems, surveyors use what they conclude is a representative sample of the overall voter population, often just 1000 voters from the millions who are eligible.

    Basically data is sampled when the amount of data returned is to large. How Google calculates / determines when a request should be sampled is something that only Google can answer. I believe this question is primary opinion based and this is my opinion.

    Google guestimates the number of rows returned by your request, divides it by the number of days in the request giving you Y. If Y is greater than X they sample. By adding the date before you actually started recording any data you are tricking the system into reducing the size of Y and there by not sampling.

    Again this is a wild guess on my part. I may test it sounds like a fun way to trick the system.