Search code examples
runiquecounting

Count flags across multiple rows depending on key


I have a dataset that consists of customers and accounts where a customer can have multiple accounts. The dataset has several 'flags' on each account.

I'm trying to get a count of 'unique' hits on these flags per customer, i.e. if 3 accounts have flag1 I want this to count as 1 hit, but if just one of the accounts have flag2 too I want this to count as 2. Essentially, I want to see how many flags each customer hits across all of their accounts.

Example Input data frame:
    cust  acct flag1 flag2 flag3
    a     123    0    1      0
    a     456    1    1      0
    b     789    1    1      1
    c     428    0    1      0
    c     247    0    1      0
    c     483    0    1      1
Example Output dataframe:
    cust  acct flag1 flag2 flag3 UniqueSum
    a     123    0    1      0      2
    a     456    1    1      0      2
    b     789    1    1      1      3
    c     428    0    1      0      2
    c     247    0    1      0      2
    c     483    0    1      1      2

I've tried to use the following:

fSumData <- ddply(fData, "cust", numcolwise(sum, c(flag1,flag2,flag3))

but this sums the acct column too giving me one row per customer where I'd like to have the same amount of rows as the customer has accounts.


Solution

  • One way that comes to my mind, is to colSum for each cust and check which are greater than 0. For example,

    > tab
      cust acct flag1 flag2 flag3
    1    a  123     0     1     0
    2    a  456     1     1     0
    3    b  789     1     1     1
    4    c  428     0     1     0
    5    c  247     0     1     0
    6    c  483     0     1     1
    > uniqueSums <- sapply(tab$cust, function(cust) length(which(colSums(tab[tab$cust == cust,3:5]) > 0)))
    > cbind(tab, uniqueSums = uniqueSums)
      cust acct flag1 flag2 flag3 uniqueSums
    1    a  123     0     1     0          2
    2    a  456     1     1     0          2
    3    b  789     1     1     1          3
    4    c  428     0     1     0          2
    5    c  247     0     1     0          2
    6    c  483     0     1     1          2
    

    For each value of cust, the function in sapply finds the rows, does a vectorized sum and checks for values that are greater than 0.