I am looking for finding a method to find the association between words in the table (or list). In each cell of the table, I have several words separated by ";".
lets say I have a table as below; some words are 'af' or 'aa' belong to one cell.
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
I want to find associations between all words in the entire dataset, between cells(not interested in within cell association). for example, how many times aa
and dd
appear in one row, or show me which words have the highest association (e.g. aa with bb, aa with dd,....).
expected output: (the numbers can be inaccurate and association rep does not have be shown with '--')
2 pairs association (numbers can be counts, probability or normalized association)
association number of associations
aa--dd 3
aa--c 3
bb--dd 2
...
3 pairs association
aa--bb--dd 3
aa--bb--c 3
...
4 pairs association
aa--bb--c--dd 2
aa--bf--c--dd 2
...
can you help me to implement it in R? Tx
I am not sure if you have something like the approach below in mind. It is basically a custom function which we use in a nested purrr::map
call. The outer call loops over the number of pairs: 2
,3
, 4
and the inner call uses combn
to create all possible combinations as input and uses the custom function to create the desired output.
library(tidyverse)
count_pairs <- function(x) {
s <- seq(x)
df[, x] %>%
reduce(s, separate_rows, .init = ., sep = ";")
group_by(across()) %>%
count() %>%
rename(set_names(s))
}
map(2:4,
~ map_dfr(combn(1:4, .x, simplify = FALSE),
count_pairs) %>% arrange(-n))
#> [[1]]
#> # A tibble: 50 x 3
#> # Groups: 1, 2 [50]
#> `1` `2` n
#> <chr> <chr> <int>
#> 1 aa dd 4
#> 2 aa bf 3
#> 3 aa c 3
#> 4 bf dd 3
#> 5 c dd 3
#> 6 aa bb 2
#> 7 af bf 2
#> 8 az bf 2
#> 9 aa cc 2
#> 10 af cc 2
#> # ... with 40 more rows
#>
#> [[2]]
#> # A tibble: 70 x 4
#> # Groups: 1, 2, 3 [70]
#> `1` `2` `3` n
#> <chr> <chr> <chr> <int>
#> 1 aa bf dd 3
#> 2 aa c dd 3
#> 3 aa bb c 2
#> 4 aa bf c 2
#> 5 aa bf cc 2
#> 6 af bf cc 2
#> 7 az bf c 2
#> 8 aa bb dd 2
#> 9 af bf dd 2
#> 10 az bf dd 2
#> # ... with 60 more rows
#>
#> [[3]]
#> # A tibble: 35 x 5
#> # Groups: 1, 2, 3, 4 [35]
#> `1` `2` `3` `4` n
#> <chr> <chr> <chr> <chr> <int>
#> 1 aa bb c dd 2
#> 2 aa bf c dd 2
#> 3 aa bf cc dd 2
#> 4 af bf cc dd 2
#> 5 az bf c dd 2
#> 6 aa bb c df 1
#> 7 aa bb cc dd 1
#> 8 aa bb cc df 1
#> 9 aa bb cd dd 1
#> 10 aa bc c dc 1
#> # ... with 25 more rows
# the data
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
Created on 2021-08-11 by the reprex package (v2.0.1)