Search code examples
pythonaltair

Manually calculate the boxplot whiskers in altair


I'm looking for a way to find the values of the boxplot whiskers, manually, within altair itself (so without adding an extra column in the dataframe).

Below is a plot I'm trying to create. The orange area should align with the boxplot whiskers, so the first value of 'x' inside the 1.5*iqr range.

I have been playing around with vega expressions (inrange, clampRange ...) but could not find a way to do so.

plot

Open the Chart in the Vega Editor

import altair as alt
import pandas as pd

values =  [0, 3, 4.4, 4.5, 4.6, 5, 7]
df = pd.DataFrame({'x': values})

points = alt.Chart(df).mark_circle(color='black', size=120).encode(
    x=alt.X('x:Q', scale=alt.Scale(zero=False)),
)

boxplot = alt.Chart(df).mark_boxplot(ticks=True, extent=1.5, outliers=True).encode(
    x='x:Q',
)

iqr = alt.Chart(df).mark_rect(color='lime').encode(
    x='q1(x):Q',
    x2='q3(x):Q'
)

whiskers = alt.Chart(df).mark_rect(color='orange').transform_aggregate(
    q1='q1(x)',
    q3='q3(x)',
).transform_calculate(
    iqr=alt.datum.q3 - alt.datum.q1,
    q0=alt.datum.q1 - (alt.datum.iqr * 1.5),
    q100=alt.datum.q3 + (alt.datum.iqr * 1.5),
).encode(
    x='q0:Q',
    x2='q100:Q',
)

minmax = alt.Chart(df).mark_rect(color='red').transform_aggregate(
    xmin='min(x)',
    xmax='max(x)'
).encode(
    x='xmin:Q',
    x2='xmax:Q',
).properties(width=1000)


((boxplot + points) & (minmax + whiskers + iqr + points)).resolve_scale(x='shared')

Solution

  • The key here is to be able to use the aggregate values of the quartiles to filter the original data. When you use transform_aggregate you are reducing the dataframe to only include the aggregate values you are creating. If you instead use transform_joinaggregate you are joining the aggregate values to the original dataframe, which means that you can then use transform_filter to return the max and min original data points within the bounds of the q1/q3 -/+ 1.5 * IQR:

    whiskers = alt.Chart(df).mark_rect(color='orange').transform_joinaggregate(
        q1='q1(x)',
        q3='q3(x)',
    ).transform_calculate(
        iqr='datum.q3 - datum.q1'
    ).transform_filter(
        # VL concatenates these strings so we can split
        # them on two lines to improve readability
        'datum.x < (datum.q3 + datum.iqr * 1.5)'
        '&& datum.x > (datum.q1 - datum.iqr * 1.5)'
    ).encode(
        x='min(x)',
        x2='max(x)',
    )
    

    enter image description here