I'm looking for a way to find the values of the boxplot whiskers, manually, within altair itself (so without adding an extra column in the dataframe).
Below is a plot I'm trying to create. The orange area should align with the boxplot whiskers, so the first value of 'x' inside the 1.5*iqr range.
I have been playing around with vega expressions (inrange, clampRange ...) but could not find a way to do so.
Open the Chart in the Vega Editor
import altair as alt
import pandas as pd
values = [0, 3, 4.4, 4.5, 4.6, 5, 7]
df = pd.DataFrame({'x': values})
points = alt.Chart(df).mark_circle(color='black', size=120).encode(
x=alt.X('x:Q', scale=alt.Scale(zero=False)),
)
boxplot = alt.Chart(df).mark_boxplot(ticks=True, extent=1.5, outliers=True).encode(
x='x:Q',
)
iqr = alt.Chart(df).mark_rect(color='lime').encode(
x='q1(x):Q',
x2='q3(x):Q'
)
whiskers = alt.Chart(df).mark_rect(color='orange').transform_aggregate(
q1='q1(x)',
q3='q3(x)',
).transform_calculate(
iqr=alt.datum.q3 - alt.datum.q1,
q0=alt.datum.q1 - (alt.datum.iqr * 1.5),
q100=alt.datum.q3 + (alt.datum.iqr * 1.5),
).encode(
x='q0:Q',
x2='q100:Q',
)
minmax = alt.Chart(df).mark_rect(color='red').transform_aggregate(
xmin='min(x)',
xmax='max(x)'
).encode(
x='xmin:Q',
x2='xmax:Q',
).properties(width=1000)
((boxplot + points) & (minmax + whiskers + iqr + points)).resolve_scale(x='shared')
The key here is to be able to use the aggregate values of the quartiles to filter the original data. When you use transform_aggregate
you are reducing the dataframe to only include the aggregate values you are creating. If you instead use transform_joinaggregate
you are joining the aggregate values to the original dataframe, which means that you can then use transform_filter
to return the max and min original data points within the bounds of the q1/q3 -/+ 1.5 * IQR:
whiskers = alt.Chart(df).mark_rect(color='orange').transform_joinaggregate(
q1='q1(x)',
q3='q3(x)',
).transform_calculate(
iqr='datum.q3 - datum.q1'
).transform_filter(
# VL concatenates these strings so we can split
# them on two lines to improve readability
'datum.x < (datum.q3 + datum.iqr * 1.5)'
'&& datum.x > (datum.q1 - datum.iqr * 1.5)'
).encode(
x='min(x)',
x2='max(x)',
)