Search code examples
rmatrixtexttext-miningadjacency-matrix

R Creating co-occurrence matrix


My question is about text mining, and text processing. I would like to build a co-occurrence matrix from my data. My data is:

dat <- read.table(text="id_reférence id_paper
        621107   621100
        621100   621101
        621107   621102
        621109   621103
        621105   621104
        621103   621105
        621109   621106
        621106   621107
        621107   621108
        621106   621109", header=T)

expected <- matrix(0,10,10)
### Article 1 has been cited by article 2
expected[2, 1] <- 1

Thanks in advance :)


Solution

  • Here another approach using data.table. A bottleneck might be that below approach does not end up in a sparseMatrix. Depending on the size of your data set it might be worth checking an approach aiming at a sparse data object.

    library(data.table)
    setDT(dat)
    # split id_reférence column into multiple rows by comma
    # code for this step taken from: #https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows
    dat = dat[, strsplit(as.character(id_reférence), ",", fixed=TRUE),
       by = .(id_paper, id_reférence)][, id_reférence := NULL][
        , setnames(.SD, "V1", "id_reférence")]
    # add value column for casting
    dat[, cite:= 1]
    # cast you data into long format
    dat = dcast(dat, id_paper ~ id_reférence, fill = 0)[, id_paper:= NULL]