I made usingtm
in R
a DocumentTermMatrix (dtm). if I understand correctly, this matrix displays for each document how often each possible term occurs. Now I can inspect this matrix and I get
Docs can design door easy finish include light provide use water
176004 1 2 11 8 0 3 3 4 4 4
181288 1 2 11 8 0 2 3 4 4 4
182465 4 4 0 2 0 0 42 13 6 0
How can I now retrieve the vector of (for example) document 181288? So I will get something like
1 2 11 8 0 2 3 4 4 4 ………
Also, it says my dtm's sparsity is 100%, is it (by approximation) 100% empty?
To retrieve your vector you can do it in multiple ways.
simple, but not recommended unless for quick test:
my_doc <- inspect(dtm[dtm$dimnames$Docs == "181288",])
Doing it like this limits you to what inspect
does and this only shows a maximum of 10 documents.
Better way, create a selection list if you want to and filter the dtm. This keeps the sparse matrix format, then transform what you need into a data.frame for further manipulation if needed.
my_selection <- c("181288", "182465")
# selection in case of dtm
my_dtm_selection <- dtm[dtm$dimnames$Docs %in% my_selection, ]
# selection in case of tdm
my_tdm_selection <- tdm[, tdm$dimnames$Docs %in% my_selection]
# create data.frame with document names as first column, followed by the terms
my_df_selection <- data.frame(docs = Docs(my_dtm_selection), as.matrix(my_dtm_selection))
The answer to your second question: yes, almost empty. Or better framed, a lot of empty cells. But you might have more data than you think if you have a lot of documents and terms.