I have a dataset that consists of customers and accounts where a customer can have multiple accounts. The dataset has several 'flags' on each account.
I'm trying to get a count of 'unique' hits on these flags per customer, i.e. if 3 accounts have flag1 I want this to count as 1 hit, but if just one of the accounts have flag2 too I want this to count as 2. Essentially, I want to see how many flags each customer hits across all of their accounts.
Example Input data frame:
cust acct flag1 flag2 flag3
a 123 0 1 0
a 456 1 1 0
b 789 1 1 1
c 428 0 1 0
c 247 0 1 0
c 483 0 1 1
Example Output dataframe:
cust acct flag1 flag2 flag3 UniqueSum
a 123 0 1 0 2
a 456 1 1 0 2
b 789 1 1 1 3
c 428 0 1 0 2
c 247 0 1 0 2
c 483 0 1 1 2
I've tried to use the following:
fSumData <- ddply(fData, "cust", numcolwise(sum, c(flag1,flag2,flag3))
but this sums the acct
column too giving me one row per customer where I'd like to have the same amount of rows as the customer has accounts.
One way that comes to my mind, is to colSum
for each cust
and check which are greater than 0. For example,
> tab
cust acct flag1 flag2 flag3
1 a 123 0 1 0
2 a 456 1 1 0
3 b 789 1 1 1
4 c 428 0 1 0
5 c 247 0 1 0
6 c 483 0 1 1
> uniqueSums <- sapply(tab$cust, function(cust) length(which(colSums(tab[tab$cust == cust,3:5]) > 0)))
> cbind(tab, uniqueSums = uniqueSums)
cust acct flag1 flag2 flag3 uniqueSums
1 a 123 0 1 0 2
2 a 456 1 1 0 2
3 b 789 1 1 1 3
4 c 428 0 1 0 2
5 c 247 0 1 0 2
6 c 483 0 1 1 2
For each value of cust
, the function in sapply
finds the rows, does a vectorized sum and checks for values that are greater than 0.