Search code examples
google-bigqueryentropy

BigQuery: compute entropy of a column


I have a suggestion for the BQ folks: I think it would be very useful if there was a built-in function that would return the entropy of a column. A column of discrete categories or values would be relatively easy. Thoughts? Does this already exist but I didn't find it?


Solution

  • The simple solution is below - it computes number of distinct values in the column, and then takes logarithm on base 2 - this gives the number of bits needed to encode all different values, which is the column entropy.

    SELECT LOG2(COUNT(DISTINCT column)) FROM Table
    

    However, this doesn't take into account the fact that different values have different probabilities. Shannon entropy formula is -SUM(P(xi)*log(P(xi)) where P(xi) is probability of value xi. Here is an example how to compute that in BigQuery, Shannon entropy for column year in natality table:

    select -sum(p*log2(p)) from (
    select ratio_to_report(c) over() p from (
    select year, count(*) c from publicdata:samples.natality group by 1))
    

    UPDATE In case the column variable is not discrete type (i.e. FLOAT), it is possible to discretize the values. Example below shows one approach - first it finds the maximum and minimum values, computes the range, and then puts all the FLOAT values (weight_pound column in natality table) into 100 buckets. After that - problem is reduced to the entropy of INTEGER values.

    select discrete_weight, count(*) from (
    select 
      cast((weight_pounds - min_weight) * 100 / range_weight as integer)
        as discrete_weight 
    from [publicdata:samples.natality] a cross join 
    (select 
      min(weight_pounds) as min_weight, 
      max(weight_pounds) - min(weight_pounds) as range_weight 
    from [publicdata:samples.natality]) b) group by 1