Search code examples
pythonsnowflake-cloud-data-platformhistogram

Creating a histogram using the Snowpark Python API


In Snowflake SQL, there is the WIDTH_BUCKET function which can be used to create a histogram:

with hist as (
  select
    width_bucket(
      x,
      min(x) over (partition by null),
      max(x) over (partition by null),
      10
    ) as hist_bin
  from mydata
)

select hist_bin, count(*) as hist_count
from hist
group by 1
order by 1

It's tedious, but it works.

However, I don't see an equivalent width_bucket function in the Snowpark Python API.

Is there a straightforward equivalent in Snowpark?

Or do I also need to construct the buckets manually with a big ugly case expression?


Solution

  • You can access any built-in SQL function with snowflake.snowpark.functions.call_function

    From the documentation examples:

    >>> df = session.create_dataframe([1, 2, 3, 4], schema=["a"])  # a single column with 4 rows
    >>> df.select(call_function("avg", col("a"))).show()
    
    "AVG(""A"")"
    2.500000

    To access a user-defined function, use snowflake.snowpark.functions.call_udf