Search code examples
rassociationstext-mining

finding associations in dataset with list of string data in each cell in R


I am looking for finding a method to find the association between words in the table (or list). In each cell of the table, I have several words separated by ";".

lets say I have a table as below; some words are 'af' or 'aa' belong to one cell.

df<-read.table(text="
A           B            C           D
af;aa;az    bf;bb        c;cc       df;dd
aa;az       bf;bc        c          dc;dd
ah;al;aa    bb           c;cd       dd
af;aa       bf           cc         dd",header=T,stringsAsFactors = F)

I want to find associations between all words in the entire dataset, between cells(not interested in within cell association). for example, how many times aa and dd appear in one row, or show me which words have the highest association (e.g. aa with bb, aa with dd,....).

expected output: (the numbers can be inaccurate and association rep does not have be shown with '--')

2 pairs association (numbers can be counts, probability or normalized association)
association    number of associations 
aa--dd          3
aa--c           3
bb--dd          2
...
3 pairs association
aa--bb--dd      3
aa--bb--c       3
...

4 pairs association
aa--bb--c--dd   2
aa--bf--c--dd   2
...

can you help me to implement it in R? Tx


Solution

  • I am not sure if you have something like the approach below in mind. It is basically a custom function which we use in a nested purrr::map call. The outer call loops over the number of pairs: 2,3, 4 and the inner call uses combn to create all possible combinations as input and uses the custom function to create the desired output.

    library(tidyverse)
    
    count_pairs <- function(x) {
    s <- seq(x)
     df[, x] %>% 
        reduce(s, separate_rows, .init = ., sep = ";")  
        group_by(across()) %>% 
        count() %>% 
        rename(set_names(s))
    }
    
    map(2:4,
        ~ map_dfr(combn(1:4, .x, simplify = FALSE),
                        count_pairs) %>% arrange(-n))
    #> [[1]]
    #> # A tibble: 50 x 3
    #> # Groups:   1, 2 [50]
    #>    `1`   `2`       n
    #>    <chr> <chr> <int>
    #>  1 aa    dd        4
    #>  2 aa    bf        3
    #>  3 aa    c         3
    #>  4 bf    dd        3
    #>  5 c     dd        3
    #>  6 aa    bb        2
    #>  7 af    bf        2
    #>  8 az    bf        2
    #>  9 aa    cc        2
    #> 10 af    cc        2
    #> # ... with 40 more rows
    #> 
    #> [[2]]
    #> # A tibble: 70 x 4
    #> # Groups:   1, 2, 3 [70]
    #>    `1`   `2`   `3`       n
    #>    <chr> <chr> <chr> <int>
    #>  1 aa    bf    dd        3
    #>  2 aa    c     dd        3
    #>  3 aa    bb    c         2
    #>  4 aa    bf    c         2
    #>  5 aa    bf    cc        2
    #>  6 af    bf    cc        2
    #>  7 az    bf    c         2
    #>  8 aa    bb    dd        2
    #>  9 af    bf    dd        2
    #> 10 az    bf    dd        2
    #> # ... with 60 more rows
    #> 
    #> [[3]]
    #> # A tibble: 35 x 5
    #> # Groups:   1, 2, 3, 4 [35]
    #>    `1`   `2`   `3`   `4`       n
    #>    <chr> <chr> <chr> <chr> <int>
    #>  1 aa    bb    c     dd        2
    #>  2 aa    bf    c     dd        2
    #>  3 aa    bf    cc    dd        2
    #>  4 af    bf    cc    dd        2
    #>  5 az    bf    c     dd        2
    #>  6 aa    bb    c     df        1
    #>  7 aa    bb    cc    dd        1
    #>  8 aa    bb    cc    df        1
    #>  9 aa    bb    cd    dd        1
    #> 10 aa    bc    c     dc        1
    #> # ... with 25 more rows
    
    # the data
    df<-read.table(text="
    A           B            C           D
    af;aa;az    bf;bb        c;cc       df;dd
    aa;az       bf;bc        c          dc;dd
    ah;al;aa    bb           c;cd       dd
    af;aa       bf           cc         dd",header=T,stringsAsFactors = F)
    
    

    Created on 2021-08-11 by the reprex package (v2.0.1)