I'm trying to plot data with a date as the X axis and several cumulative counts as Ys.
I have a set of items such as:
id1 date1 user1
id2 date2 user1
id3 date3 user2
With this example, I'd want the plot to have 2 lines, the X axis would have three entries (date1, date2, date3) and user1 would have a Y-value of 1 at date1, 2 at date2, 2 at date3; user2 would have 0 at date1, 0 at date2, 1 at date3.
Directly making a Chart, I can't see what I should use to have this cumulative count. E.g.
Chart(data).mark_line().encode(x='date:T', y='count(*)', color='username')
obviously creates a chart where most values are 0 (few entries have exactly the same date).
Ideally,
Chart(data).mark_line().encode(x='date:T', y='cumcount(*)', color='username')
would work, but there doesn't seem to be an equivalent in the documentation.
In my real case, I have ~10 users and a few thousand entries over a few months.
I don't think cumulative count aggregation is yet available in Altair. In the meanwhile, one can do the corresponding manipulation in Pandas. Here is one such way. I am sure there can be more efficient ways of doing so.
import pandas as pd
import numpy as np
np.random.seed(0)
user_list = ['user1', 'user2']
df = pd.DataFrame({'date':range(2000, 2010),
'username':np.random.choice(user_list, 10)})
This is how df
looks like.
date username
0 2000 user1
1 2001 user2
2 2002 user2
3 2003 user1
4 2004 user2
5 2005 user2
6 2006 user2
7 2007 user2
8 2008 user2
9 2009 user2
Cross-tabulation
d = pd.crosstab(df.date, columns=df.username).cumsum()
d = d.stack().reset_index()
d = d.rename(columns={0:'CummulativeCount'})
This is the output of d.head()
.
date username CummulativeCount
0 2000 user1 1
1 2000 user2 0
2 2001 user1 1
3 2001 user2 1
4 2002 user1 1
Now, we can use Altair without bothering about any aggregation.
from altair import Chart
c = Chart(d)
c.mark_line().encode(x='date:T', y='CummulativeCount:Q', color='username')