I have a query result named 'uniqueClientsWithLifeTime' that includes a clientid and their average lifetime in months, no categories
Iḿ trying to get the trimmed mean (removing outliers) for the lifetime column
this is my query so far
#upper bound and lower bound
WITH
bounds AS (
SELECT
PERCENTILE_CONT(ClientLifeTime, .25) OVER() AS lower_bound,
PERCENTILE_CONT(ClientLifeTime, .75) OVER() AS upper_bound
FROM
uniqueClientsWithLifeTime
)
SELECT
AVG(ClientLifeTime) as trimmed_mean,
FROM
uniqueClientsWithLifeTime
WHERE
ClientLifeTime > (SELECT bounds.lower_bound FROM bounds)
AND
ClientLifeTime < (SELECT bounds.upper_bound FROM bounds)
but I get an error that says:
Scalar subquery produced more than one element and If I do SELECT bounds.lower_bound FROM bounds or the upper_bound, I do get a value although Iḿ not sure if it is the correct value.
Things I've looked into:
I check this article on how to do this but it seems to provide a solution to if you have data that can be splitted on categories although Iḿ not sure how to implement the function there in BigQuery https://tech.aaronteoh.com/bigquery-parameter-estimation/#trimmed_mean
thereś also another function called approx_quantiles which seems to be a more recent approach on how to get percentiles, but Iḿ not sure how to implement it https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#approx_quantiles
Any help is appreciated
The problem with current solution is that windowing function is not the same as aggregate function - former is not reducing ("aggregating") number of rows, while the latter does. Windowing function might serve same purpose, but it won't change number of rows at all...
It's pretty fundamental concept for sql-based platforms, so I'd really recommend to go back to the basics on this one: https://www.google.com/search?q=window+functions+and+aggregate+functions
Solution-wise your second idea is the right one:
#upper bound and lower bound
WITH
bounds AS (
SELECT
APPROX_QUANTILES(ClientLifeTime, 100)[OFFSET(25)] AS lower_bound,
APPROX_QUANTILES(ClientLifeTime, 100)[OFFSET(75)] AS upper_bound
FROM
uniqueClientsWithLifeTime
)
SELECT
AVG(ClientLifeTime) as trimmed_mean,
FROM
uniqueClientsWithLifeTime
WHERE
ClientLifeTime > (SELECT bounds.lower_bound FROM bounds)
AND
ClientLifeTime < (SELECT bounds.upper_bound FROM bounds)
100
will translate quantiles into percentile (for 100 quantiles 1 quantile = 1 percentile).
more details about the function you can find here.