Search code examples
rmatrixtm

Subsetting a matrix, addressing colnames


I have this document term matrix from package R{tm} which i have coerced to as.matrix. MWE here:

> inspect(dtm[1:ncorpus, intersect(colnames(dtm), thai_list)])
<<DocumentTermMatrix (documents: 15, terms: 4)>>
Non-/sparse entries: 17/43
Sparsity           : 72%
Maximal term length: 12
Weighting          : term frequency (tf)

Terms
Docs toyota_suv gmotors_suv ford_suv nissan_suv
1      0       1       0            0
2      0       1       0            0
3      0       1       0            0
4      0       2       0            0
5      0       4       0            0
6      1       1       0            0
7      1       1       0            0
8      0       1       0            0
9      0       1       0            0
10     0       1       0            0

I need to subset this as.matrix(dtm), such that I get only documents (rows) which refer to toyota_suv but no other vehicle. I get a subset for one term (toyota_suv) using dmat<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota_suv")]) which works well. How do I set up a query: documents where toyota_suv is non-zero but values of non-toyota_suv columns are zero? I could have specified column-wise as ==0 but this matrix is dynamically generated. In some markets, there may be four cars, in some markets there may be ten. I cannot specify colnames beforehand. How do I (dynamically) club all the non-toyota_suv columns to be zero, like all_others==0? Any help will be much appreciated.


Solution

  • You can accomplish this by getting the index position for toyota_suv, and then subsetting dtm to match that for non-zero, and all other columns using negative indexing on the same index variable to ensure they are all zero.

    Here I modified your dtm slightly so that the two cases where toyota_sub are non-zero meet the criteria you are looking for (since none in your example met them):

    dtm <- read.table(textConnection("
    toyota_suv gmotors_suv ford_suv nissan_suv
          0       1       0            0
          0       1       0            0
          0       1       0            0
          0       2       0            0
          0       4       0            0
          1       0       0            0
          1       0       0            0
          0       1       0            0
          0       1       0            0
          0       1       0            0"), header = TRUE)
    

    Then it works:

    # get the index of the toyota_suv column
    index_toyota_suv <- which(colnames(dtm) == "toyota_suv")
    
    # select only cases where toyota_suv is non-zero and others are zero
    dtm[dtm[, "toyota_suv"] > 0 & !rowSums(dtm[, -index_toyota_suv]), ]
    ##   toyota_suv gmotors_suv ford_suv nissan_suv
    ## 6          1           0        0          0
    ## 7          1           0        0          0
    

    Note: This is not really a text analysis question at all, but rather one for how to subset matrix objects.