Search code examples
rbashtext-miningtm

R tm TermDocumentMatrix based on a sparse matrix


I have a collection of books in txt format and want to apply some procedures of the tm R library to them. However, I prefer to clean the texts in bash rather than in R because it is much faster.

Suppose I am able to get from bash a data.frame such as:

book term frequency
--------------------
1     the      10
1     zoo      2
2     animal   2
2     car      3
2     the      20

I know that TermDocumentMatrices are actually sparse matrices with metadata. In fact, I can create a sparse matrix from the TDM using the TDM's i, j and v entries for the i, j and x ones of the sparseMatrix function. Please help me if you know how to do the inverse, or in this case, how to construct a TDM by using the three columns in the data.frame above. Thanks!


Solution

  • You could try

    library(tm)
    library(reshape2)
    txt <- readLines(n = 7)
    book term frequency
    --------------------
    1     the      10
    1     zoo      2
    2     animal   2
    2     car      3
    2     the      20
    df <- read.table(header=T, text=txt[-2])
    dfwide <- dcast(data = df, book ~ term, value.var = "frequency", fill = 0)
    mat <- as.matrix(dfwide[, -1]) 
    dimnames(mat) <- setNames(dimnames(dfwide[-1]), names(df[, 1:2]))
    (tdm <- as.TermDocumentMatrix(t(mat), weighting = weightTf))
    # <<TermDocumentMatrix (terms: 4, documents: 2)>>
    #   Non-/sparse entries: 5/3
    # Sparsity           : 38%
    # Maximal term length: 6
    # Weighting          : term frequency (tf)
    
    as.matrix(tdm)
    #        Docs
    # Terms     1  2
    # animal    0  2
    # car       0  3
    # the      10 20
    # zoo       2  0