I have this document term matrix from package R{tm} which i have coerced to as.matrix
. MWE here:
> inspect(dtm[1:ncorpus, intersect(colnames(dtm), thai_list)])
<<DocumentTermMatrix (documents: 15, terms: 4)>>
Non-/sparse entries: 17/43
Sparsity : 72%
Maximal term length: 12
Weighting : term frequency (tf)
Terms
Docs toyota_suv gmotors_suv ford_suv nissan_suv
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 2 0 0
5 0 4 0 0
6 1 1 0 0
7 1 1 0 0
8 0 1 0 0
9 0 1 0 0
10 0 1 0 0
I need to subset this as.matrix(dtm)
, such that I get only documents (rows) which refer to toyota_suv
but no other vehicle. I get a subset for one term (toyota_suv) using dmat<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota_suv")])
which works well. How do I set up a query: documents where toyota_suv is non-zero but values of non-toyota_suv columns are zero? I could have specified column-wise as ==0
but this matrix is dynamically generated. In some markets, there may be four cars, in some markets there may be ten. I cannot specify colnames beforehand. How do I (dynamically) club all the non-toyota_suv columns to be zero, like all_others==0?
Any help will be much appreciated.
You can accomplish this by getting the index position for toyota_suv
, and then subsetting dtm
to match that for non-zero, and all other columns using negative indexing on the same index variable to ensure they are all zero.
Here I modified your dtm
slightly so that the two cases where toyota_sub
are non-zero meet the criteria you are looking for (since none in your example met them):
dtm <- read.table(textConnection("
toyota_suv gmotors_suv ford_suv nissan_suv
0 1 0 0
0 1 0 0
0 1 0 0
0 2 0 0
0 4 0 0
1 0 0 0
1 0 0 0
0 1 0 0
0 1 0 0
0 1 0 0"), header = TRUE)
Then it works:
# get the index of the toyota_suv column
index_toyota_suv <- which(colnames(dtm) == "toyota_suv")
# select only cases where toyota_suv is non-zero and others are zero
dtm[dtm[, "toyota_suv"] > 0 & !rowSums(dtm[, -index_toyota_suv]), ]
## toyota_suv gmotors_suv ford_suv nissan_suv
## 6 1 0 0 0
## 7 1 0 0 0
Note: This is not really a text analysis question at all, but rather one for how to subset matrix objects.