Search code examples
rtext-miningcorpusterm-document-matrix

Document Term Matrix will not maintain decimal places of numbers


Before I updated my version of RStudio, everything worked great. With the update something has changed with Document Term Matrix in the 'tm' package. I want to create a dtm, but with numbers. For instance if I have a .csv with one column as shown below:

x
1.01
11.21
123.35
212.11

I want the column names in my term matrix to look like this:

1.01 11.21 123.35 212.11
1    0     0      0
0    1     0      0
0    0     1      0
0    0     0      1

But instead it looks like this:

123 212
0   0
0   0
1   0
0   1

Here's the code that used to work:

corpus = Corpus(VectorSource(x)) dtm = DocumentTermMatrix(corpus) dtm_df = as.data.frame(as.matrix(dtm))

Thanks in advance


Solution

  • From the 'tm' package maintainer Ingo Feinerer:

    Here's the code that used to work:

    corpus = Corpus(VectorSource(x))

    Try VCorpus() instead of Corpus().

    dtm = DocumentTermMatrix(corpus) dtm_df = as.data.frame(as.matrix(dtm))

    That is highly inefficient (since as.matrix() generates a dense representation from the sparse term-document matrix).

    Best regards, Ingo