Search code examples
pythonpandasdataframeunique

How to count the number of unique elements in a column in a Pandas dataframe


Say I have a data frame that records a customer (denoted by a letter) and the date that they arrived in a store.

    customer       date
       A        2010-01-01
       B        2010-01-01
       A        2010-01-02
       C        2010-01-02
       D        2010-01-03
       D        2010-01-03

I want to count the number of unique customers that come in to the store and display the cumulative amount of unique customers each day. So I don't want to double count a customer that's already been in the store in the past.

The desired output would be something like this:

     date         cumulative unique customers to date
  2010-01-01                     2   # A, B
  2010-01-02                     3   # A, B, C
  2010-01-03                     4   # A, B, C, D

So after 3 days, we have 4 distinct customers that have arrived

So far, the closest I've come is using the groupby() function:

df.groupby('Date')['Customer'].nunique() which gives me

     date                 unique customers per date
  2010-01-01                     2   # A, B
  2010-01-02                     2   # A, C
  2010-01-03                     1   # D

However, this is just the number of unique customers per day.

I'm thinking I need to create a new column and a function that goes through each row and places a 1 into the new column if it's the first visit of the customer in that row, 0 otherwise. Then apply the cumsum() method on this new column (for each day) to get the desired outcome.

I'm having problems coming up with code that takes a value in the 'customer' column and compares it to every other value in the column to determine if it's the first of its kind - or is this even necessary?


Solution

  • try this, drop_duplicates along with groupby

    >>> df.drop_duplicates(["customer"]).groupby("date")['customer'].nunique().cumsum()
    date
    2010-01-01    2
    2010-01-02    3
    2010-01-03    4
    Name: customer, dtype: int64