Search code examples
pythonpandasdata-analysispython-ggplot

Plotting event density in Python with ggplot and pandas


I am trying to visualize data of this form:

  timestamp               senderId
0     735217  106758968942084595234
1     735217  114647222927547413607
2     735217  106758968942084595234
3     735217  106758968942084595234
4     735217  114647222927547413607
5     etc...

geom_density works if I don't separate the senderIds:

df = pd.read_pickle('data.pkl')
df.columns = ['timestamp', 'senderId']
plot = ggplot(aes(x='timestamp'), data=df) + geom_density()
print plot

The result looks as expected:

density plot

However if I want to show the senderIds separately, as is done in the doc, it fails:

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
ValueError: `dataset` input should have multiple elements.

Trying out with a larger dataset (~40K events):

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
numpy.linalg.linalg.LinAlgError: singular matrix

Any idea? There are some answers on SO for those errors but none seems relevant.

This is the kind of graph I want (from ggplot's doc):

density plot


Solution

  • With the smaller dataset:

    > plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
    ValueError: `dataset` input should have multiple elements.
    

    This was because some senderIds had only one row.

    With the bigger dataset:

    > plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
    numpy.linalg.linalg.LinAlgError: singular matrix
    

    This was because for some senderIds I had multiple rows at the exact same timestamp. This is not supported by ggplot. I could solve it by using finer timestamps.