Search code examples
pythondata-visualizationaltair

Cumulative count with altair


I'm trying to plot data with a date as the X axis and several cumulative counts as Ys.

I have a set of items such as:

id1 date1 user1
id2 date2 user1
id3 date3 user2

With this example, I'd want the plot to have 2 lines, the X axis would have three entries (date1, date2, date3) and user1 would have a Y-value of 1 at date1, 2 at date2, 2 at date3; user2 would have 0 at date1, 0 at date2, 1 at date3.

Directly making a Chart, I can't see what I should use to have this cumulative count. E.g.

Chart(data).mark_line().encode(x='date:T', y='count(*)', color='username')

obviously creates a chart where most values are 0 (few entries have exactly the same date).

Ideally,

Chart(data).mark_line().encode(x='date:T', y='cumcount(*)', color='username')

would work, but there doesn't seem to be an equivalent in the documentation.

In my real case, I have ~10 users and a few thousand entries over a few months.


Solution

  • I don't think cumulative count aggregation is yet available in Altair. In the meanwhile, one can do the corresponding manipulation in Pandas. Here is one such way. I am sure there can be more efficient ways of doing so.

    import pandas as pd
    import numpy as np
    np.random.seed(0)
    user_list = ['user1', 'user2']
    df = pd.DataFrame({'date':range(2000, 2010),
                      'username':np.random.choice(user_list, 10)})
    

    This is how df looks like.

        date    username
    0   2000    user1
    1   2001    user2
    2   2002    user2
    3   2003    user1
    4   2004    user2
    5   2005    user2
    6   2006    user2
    7   2007    user2
    8   2008    user2
    9   2009    user2
    
    Cross-tabulation
    d = pd.crosstab(df.date, columns=df.username).cumsum()
    d = d.stack().reset_index()
    d = d.rename(columns={0:'CummulativeCount'})
    

    This is the output of d.head().

    date    username    CummulativeCount
    0   2000    user1   1
    1   2000    user2   0
    2   2001    user1   1
    3   2001    user2   1
    4   2002    user1   1
    

    Now, we can use Altair without bothering about any aggregation.

    from altair import Chart
    c = Chart(d)
    c.mark_line().encode(x='date:T', y='CummulativeCount:Q', color='username')
    

    enter image description here