I have a collection of books in txt format and want to apply some procedures of the tm
R library to them. However, I prefer to clean the texts in bash rather than in R because it is much faster.
Suppose I am able to get from bash a data.frame such as:
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
I know that TermDocumentMatrices are actually sparse matrices with metadata. In fact, I can create a sparse matrix from the TDM using the TDM's i, j and v entries for the i, j and x ones of the sparseMatrix function. Please help me if you know how to do the inverse, or in this case, how to construct a TDM by using the three columns in the data.frame above. Thanks!
You could try
library(tm)
library(reshape2)
txt <- readLines(n = 7)
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
df <- read.table(header=T, text=txt[-2])
dfwide <- dcast(data = df, book ~ term, value.var = "frequency", fill = 0)
mat <- as.matrix(dfwide[, -1])
dimnames(mat) <- setNames(dimnames(dfwide[-1]), names(df[, 1:2]))
(tdm <- as.TermDocumentMatrix(t(mat), weighting = weightTf))
# <<TermDocumentMatrix (terms: 4, documents: 2)>>
# Non-/sparse entries: 5/3
# Sparsity : 38%
# Maximal term length: 6
# Weighting : term frequency (tf)
as.matrix(tdm)
# Docs
# Terms 1 2
# animal 0 2
# car 0 3
# the 10 20
# zoo 2 0