Say I have a data frame that records a customer (denoted by a letter) and the date that they arrived in a store.
customer date
A 2010-01-01
B 2010-01-01
A 2010-01-02
C 2010-01-02
D 2010-01-03
D 2010-01-03
I want to count the number of unique customers that come in to the store and display the cumulative amount of unique customers each day. So I don't want to double count a customer that's already been in the store in the past.
The desired output would be something like this:
date cumulative unique customers to date
2010-01-01 2 # A, B
2010-01-02 3 # A, B, C
2010-01-03 4 # A, B, C, D
So after 3 days, we have 4 distinct customers that have arrived
So far, the closest I've come is using the groupby()
function:
df.groupby('Date')['Customer'].nunique()
which gives me
date unique customers per date
2010-01-01 2 # A, B
2010-01-02 2 # A, C
2010-01-03 1 # D
However, this is just the number of unique customers per day.
I'm thinking I need to create a new column and a function that goes through each row and places a 1 into the new column if it's the first visit of the customer in that row, 0 otherwise. Then apply the cumsum() method on this new column (for each day) to get the desired outcome.
I'm having problems coming up with code that takes a value in the 'customer' column and compares it to every other value in the column to determine if it's the first of its kind - or is this even necessary?
try this, drop_duplicates
along with groupby
>>> df.drop_duplicates(["customer"]).groupby("date")['customer'].nunique().cumsum()
date
2010-01-01 2
2010-01-02 3
2010-01-03 4
Name: customer, dtype: int64