Search code examples
pythonvisualizationaltairvega-lite

Find the last values of a variable in Altair


I'm trying to create a bar chart in Altair showing the final values of a graph that shows accumulated sums of data per class.

Starting with creating a graphs that shows the accumulating sum:

df = pd.DataFrame({"time":[1,2,3,1,2,3], "value": [1,5,2,2,3,7], "class": ["A","A","A","B","B","B"]})

base = alt.Chart(df).transform_window(
    cumulative = "sum(value)",
    groupby = ["class"],
    frame = [None, 0],
)

base.mark_line().encode(
    x = "time:Q",
    y = "cumulative:Q",
    # detail = "class",
    color = "class"
)

enter image description here

Now I want to create a bar graph, that shows "A" and "B" on the x-axis and the final value of the lines above (i.e. 8 for "A" and 12 for "B") on the y-axis.

I thought I should do that like this, but it doesn't seem to work:

base2 = base.transform_window(
    final = "last_value(cumulative)",
    groupby = ["class"],
    sort = [alt.SortField(field = "time")],
    frame = [None, None],
)

base2.mark_bar().encode(
    x = "class",
    y = "final:Q",
)

It gives me a value of 24 for "A" and 36 for "B". Any ideas on how to do this?


Solution

  • For this particular use case, rather than using a window transform, it seems like you could use a simple aggregate (link) to compute the final values for the bar chart based on the original dataframe (df):

    alt.Chart(df).mark_bar().encode(
        y='final:Q',
        x='class:O'
    ).transform_aggregate(
        final='sum(value)',
        groupby=["class"]
    )
    

    In some ways, this side-steps the question you asked (computing the last value), but given the use case of a cumulative aggregate, this more straightforward solution seems to address the key problem posed here.

    Looking at the code and partial results you shared, I wonder if your solution is actually close to working. It looks like the two values you are getting are each 3x the value you want (e.g., 83=24, 123=36). Given that you have three data points for each class, I think what is happening is that you are computing the last_value from your cumulative result, but when you create the bars it is summing them because you still have six data points in your dataframe, with each data point having the last_value column set as expected. To get only one data point with the last value, an option might be to then produce an aggregate using min or max on the last_value, grouped by "class" (the one you use does not really matter, given that each data point has the same value for the particular class.