Search code examples
google-bigquery

getting trimmed mean with no categorization in big query


I have a query result named 'uniqueClientsWithLifeTime' that includes a clientid and their average lifetime in months, no categories

Iḿ trying to get the trimmed mean (removing outliers) for the lifetime column

this is my query so far

#upper bound and lower bound
WITH
bounds AS (
  SELECT
    PERCENTILE_CONT(ClientLifeTime, .25) OVER() AS lower_bound,
    PERCENTILE_CONT(ClientLifeTime, .75) OVER() AS upper_bound
  FROM 
    uniqueClientsWithLifeTime
)

SELECT
  AVG(ClientLifeTime) as trimmed_mean,
FROM 
  uniqueClientsWithLifeTime
WHERE 
  ClientLifeTime > (SELECT bounds.lower_bound FROM bounds)
  AND 
  ClientLifeTime < (SELECT bounds.upper_bound FROM bounds) 

but I get an error that says:

Scalar subquery produced more than one element and If I do SELECT bounds.lower_bound FROM bounds or the upper_bound, I do get a value although Iḿ not sure if it is the correct value.

Things I've looked into:

  1. I check this article on how to do this but it seems to provide a solution to if you have data that can be splitted on categories although Iḿ not sure how to implement the function there in BigQuery https://tech.aaronteoh.com/bigquery-parameter-estimation/#trimmed_mean

  2. thereś also another function called approx_quantiles which seems to be a more recent approach on how to get percentiles, but Iḿ not sure how to implement it https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#approx_quantiles

Any help is appreciated


Solution

  • The problem with current solution is that windowing function is not the same as aggregate function - former is not reducing ("aggregating") number of rows, while the latter does. Windowing function might serve same purpose, but it won't change number of rows at all...

    It's pretty fundamental concept for sql-based platforms, so I'd really recommend to go back to the basics on this one: https://www.google.com/search?q=window+functions+and+aggregate+functions

    Solution-wise your second idea is the right one:

    #upper bound and lower bound
    WITH
    bounds AS (
      SELECT
        APPROX_QUANTILES(ClientLifeTime, 100)[OFFSET(25)] AS lower_bound,
        APPROX_QUANTILES(ClientLifeTime, 100)[OFFSET(75)] AS upper_bound
      FROM 
        uniqueClientsWithLifeTime
    )
    
    SELECT
      AVG(ClientLifeTime) as trimmed_mean,
    FROM 
      uniqueClientsWithLifeTime
    WHERE 
      ClientLifeTime > (SELECT bounds.lower_bound FROM bounds)
      AND 
      ClientLifeTime < (SELECT bounds.upper_bound FROM bounds) 
    

    100 will translate quantiles into percentile (for 100 quantiles 1 quantile = 1 percentile).

    more details about the function you can find here.