I have my documents as:
doc1 = very good, very bad, you are great
doc2 = very bad, good restaurent, nice place to visit
I want to make my corpus separated with ,
so that my final DocumentTermMatrix
becomes:
terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
I know, how to calculate DocumentTermMatrix
of individual words but don't know how to make the corpus separated for each phrase
in R. A solution in R
is preferred but solution in Python
is also welcomed.
What I have tried is:
> library(tm)
> library(RWeka)
> BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
> options(mc.cores=1)
> texts <- c("very good, very bad, you are great","very bad, good restaurent, nice place to visit")
> corpus <- Corpus(VectorSource(texts))
> a <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
> as.matrix(a)
I am getting:
Docs
Terms 1 2
bad good restaurent 0 1
bad you are 1 0
good restaurent nice 0 1
good very bad 1 0
nice place to 0 1
place to visit 0 1
restaurent nice place 0 1
very bad good 0 1
very bad you 1 0
very good very 1 0
you are great 1 0
What I want is not combination of words but only the phrases that I showed in my matrix.
Here's one approach using qdap
+ tm
packages:
library(qdap); library(tm); library(qdapTools)
dat <- list2df(list(doc1 = "very good, very bad, you are great",
doc2 = "very bad, good restaurent, nice place to visit"), "text", "docs")
x <- sub_holder(", ", dat$text)
m <- dtm(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs) )
weightTfIdf(m)
inspect(weightTfIdf(m))
## A document-term matrix (2 documents, 5 terms)
##
## Non-/sparse entries: 4/6
## Sparsity : 60%
## Maximal term length: 19
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
##
## Terms
## Docs good restaurent nice place to visit very bad very good you are great
## doc1 0.0000000 0.0000000 0 0.3333333 0.3333333
## doc2 0.3333333 0.3333333 0 0.0000000 0.0000000
You could also do one fell swoop and return a DocumentTermMatrix
but this may be harder to understand:
x <- sub_holder(", ", dat$text)
apply_as_tm(t(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs)),
weightTfIdf, to.qdap=FALSE)