Search code examples
google-analyticsgoogle-analytics-api

Google Analytics - Sampled Data presents more sessions than API query


I'm working on automating a Google Analytics report using the Core Reporting API V3.

When I request the data for a query that contains a segment I have previously defined, then the following scenario happens

The metrics such as Sessions, Users and Pageviews that are reported by the query obtained with the API are higher than the ones showed in Google Analytics Reports. I noticed that in the Reports presented by GA they mention that they are sampled. This raises doubts since I would think that the sampling effect would be to have lower metrics than the whole counted metrics.

How does this make any sense? (Metrics in the non sampled report having higher levels than the ones in the sampled report)


Solution

  • Sampling just means that the data is less accurate: it is equally likely to be greater or less than the true value.

    By way of example, suppose that I work in a company with exactly 10,000 employees. The big cheeses want to perform a very detailed survey of their workforce, to make sure that everybody's happy, but think that losing 10,000 hours of work time just isn't OK. Instead, they randomly select 1,000 staff members. So long as the selection is truly random, that should be a representative sample, meaning that the gender balance, ethnicity, percentage with kids, average commute time etc. of this group will be roughly the same as the workforce as a whole.

    Similarly, if you ask Google Analytics to run a report that requires a lot of aggregation, it might decide to look at only half your data. even the simplest requests often require a lot of computation; from their perspective it's much cheaper to randomly select only 40% or 50% of the sessions in that period, and scale the results up.

    They multiply the results afterwards to compensate, so the results that you see will be approximately equal to the true value. The biggest variation will come in things that don't happen very often; suppose you had an event for 'someone just spent £1,000' that's likely to take place once a year. If this randomly comes up in Google's sample, it might decide that it happens twice a year. Otherwise, it might think it never happens.

    If you're facing heavy sampling, there are several ways to avoid it. I recommend the following:

    • Avoid the Users metric; it's one of the most time consuming to calculate.
    • Keep your time periods short.
    • Avoid using complicated segments.
    • Try not to use too many dimensions at once.
    • Try not to have so many hits! Do you have a ton of superfluous events? Are you using the same code on more than one site? Overusing Virtual Page Views?

    If you have Google Analytics Premium, you can request Unsampled reports, although you should watch out for the exported totals given for the Users metric; they still screw this up.

    Sampling can happen at any rate; in extreme situations they might cut you down to less than 1% of sessions. You should take any sampled stats with a pinch of salt, but also understand that they know what they're doing. If you're sample size is 50% or more, you're fine. Any less than 40% and you should start to be worried. If you're getting less than about 1% you're really stretching Google Analytics beyond its breaking point, so don't be surprised if it's not doing its best to help you.