Search code examples
rstringdataframesimilarity

How to find similarity between different variables according to consequent string values in R?


I have a DF structured like this:

 X  Y  Z 
 D  E  1
 D  F  2
 D  G  3
 L  E  1
 L  F  2
 L  G  3
 M  N  4
 M  O  5
 S  N  4
 S  O  5

i want to obtain two different clusters ("L - D", "M - S"), according to the second column values which they have in common. So, the output would be structured like this:

 Clust.1   Clust.2
    L         M
    D         S

How could i do?

Thank you for your suggestions!


Solution

  • Here is an idea via tidyverse,

    df %>% 
     group_by(X) %>% 
     summarise(Z = toString(Z)) %>% 
     group_by(Z) %>% 
     mutate(new = seq(n())) %>% 
     spread(Z, X)
    

    which gives,

    # A tibble: 2 x 3
        new `1, 2, 3` `4, 5`
    * <int>    <fctr> <fctr>
    1     1         D      M
    2     2         L      S