Search code examples
rdataframedplyr

How do I group rows based on 3 columns according to the following in R


Account Tier 1 Tier 2 Tier 3
Account 1 1 21 55
Account 2 1 21 55
Account 3 2 52 72
Account 4 2 52 66
Account 5 3 52 66
Account 6 3 52 99
Account 7 4 63 43
Account 8 4 63 49
Account 9 5 36 94
Account 10 5 42 55

How do I group the accounts based on all 3 tiers according to the following?

If Account A & B are in the same group in Tier 1 and if Account 1, 2 & 10 are in the same group in Tier 3, then Account 1, 2, 9 & 10 need to be classified in the same final group. Account 9 was included because it groups with Account 10 in Tier 1.

Desired output:

Account Tier 1 Tier 2 Tier 3 Final Group
Account 1 1 21 55 Group 1
Account 2 1 21 55 Group 1
Account 3 2 52 72 Group 2
Account 4 2 52 66 Group 2
Account 5 3 52 66 Group 2
Account 6 3 52 99 Group 2
Account 7 4 63 43 Group 3
Account 8 4 63 49 Group 3
Account 9 5 36 94 Group 1
Account 10 5 42 55 Group 1

This is a simplified version of a bigger problem which has 7+ tiers across 500k data points.


Solution

  • Another example where graph theory and the igraph package comes in handy, to find the connected components:

    ## make a two-column 'edgelist' showing the account<-->tier relationships
    datlong <- cbind(dat[1], stack(dat[-1]))
    
    library(igraph)
    ## make a graph object
    g <- graph_from_data_frame(datlong[1:2])
    ## extract the connected components
    cmp <- components(g)
    ## assign the membership of the accounts based on the components
    dat$final_group <- cmp$membership[match(dat$Account, names(cmp$membership))]
    
    dat
    #     Account Tier1 Tier2 Tier3 final_group
    #1   Account1     1    21    55           1
    #2   Account2     1    21    55           1
    #3   Account3     2    52    72           2
    #4   Account4     2    52    66           2
    #5   Account5     3    52    66           2
    #6   Account6     3    52    99           2
    #7   Account7     4    63    43           3
    #8   Account8     4    63    49           3
    #9   Account9     5    36    94           1
    #10 Account10     5    42    55           1
    

    Where dat was:

    dat <- read.table(text="Account  Tier1  Tier2  Tier3
    Account1  1  21  55
    Account2  1  21  55
    Account3  2  52  72
    Account4  2  52  66
    Account5  3  52  66
    Account6  3  52  99
    Account7  4  63  43
    Account8  4  63  49
    Account9  5  36  94
    Account10  5  42  55", header=TRUE)