Search code examples
rdataframedplyrdata-analysis

Count the number of records per month in R


Given Merchant ID, Card Number and Date, I need to count the number of Card Numbers associated to each Merchant ID per month. So far I can only count the number of Card Numbers per Merchant ID but I cannot count it yet per Month using this code:

HitsCounter <- summarise(group_by(df, MerchantID), HitsTotal = n_distinct(CardNum))

Reproducible Data:

MerchantID <- c('1234, '1234', '1234', '1234')
CardNum <- c('1abc1', '1abc1', '2xyz2', '3ijk3')
Date <- c('2020-05-07', '2020-05-07', '2019-10-12', '2019-10-25')
df <- data.frame(MerchantID, CardNum, Date)

What I get when I merge HitsCounter to my dataframe df:

MerchantID  CardNum     Date        HitsTotal
  1234       1abc1   2020-05-07         3
  1234       1abc1   2020-05-07         3
  1234       2xyz2   2019-10-12         3
  1234       3ijk3   2019-10-25         3
 

Expected Output Data:

MerchantID  CardNum     Date        HitsPerMonth
  1234       1abc1   2020-05-07          2
  1234       1abc1   2020-02-07          1
  1234       2xyz2   2020-05-12          2
  1234       3ijk3   2019-10-25          1
  1234       3ijk3   2019-10-01          1      

Take note that the logic should completely disregard any duplicate CardNumber even with a different month since a "Hit" is counted per unique CardNumber.

To put it simply, it should answer the question:

How many Unique Card Numbers transacted with the same Merchant ID in the same month?


Solution

  • You can extract year-month value from date and count the unique CardNum values for each MerchantID and month.

    library(dplyr)
    
    df %>%
      mutate(Date = as.Date(Date), 
             month = format(Date, "%Y-%m")) %>%
      group_by(MerchantID, month) %>%
      mutate(HitsPerMonth = n_distinct(CardNum))
    

    If a CardNum has to be calculated only once for each MerchantID irrespective of the month then you can do :

    df %>%
      mutate(Date = as.Date(Date), 
             month = format(Date, "%Y-%m")) %>%
      group_by(MerchantID) %>%
      mutate(CardNum = replace(CardNum, duplicated(CardNum), NA)) %>%
      group_by(month, .add  =TRUE) %>%
      mutate(HitsPerMonth = n_distinct(na.omit(CardNum)))