Remove rows with character(0) from a data.frame before proceeding to dtm

I'm analyzing a data frame of product reviews that contain some empty entries or text written in foreign language. The data also contain some customer attributes which can be used as "features" in later analysis.

To begin with, I will first convert the reviews column into DocumentTermMatrix and then convert it to lda format, I then plan to throw in the documents and vocab objects generated from the lda process along with selected columns from the original data frame into stm's prepDocuments() function such that I can leverage the more versatile estimation functions from that package, using customer attributes as features to predict topic salience.

However, because some of the empty cells, punctuation, and foreign characters might be removed during the pre-processing and thereby creating some character(0) rows in the lda's documents object, making those reviews unable to match their corresponding rows in the original data frame. Eventually, this will prevent me from generating the desired stm object from prepDocuments().

Methods to remove empty documents certainly exist (such as the methods recommended in this previous thread), but I am wondering if there're ways to also remove the rows correspond to the empty documents from the original data frame such that the number of lda documents and the row dimension of the data frame that will be used as meta in the stm functions are aligned? Will indexing help?

Part of my data is listed at below.

df = data.frame(reviews = c("buenisimoooooo", "excelente", "excelent", 
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone", 
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase", 
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late", 
"good phone good reception home fringe area screen lovely just right size good buy", "@#haha", "phone verizon contract phone buyer beware", "这东西太棒了", 
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund", 
"good phone price fine", "phone star battery little soon yes"), 
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1), 
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon", 
           "amazon", "zappos", "newegg", "amazon", "amazon", 
           "amazon", "amazon", "amazon", "zappos", "amazon", 
           "amazon", "newegg", "amazon", "amazon", "amazon"))

Solution

This is a situation where embracing tidy data principles can really offer a nice solution. To start with, "annotate" the dataframe you presented with a new column that keeps track of doc_id, which document each word belongs to, and then use unnest_tokens() to transform this to a tidy data structure.

library(tidyverse)
library(tidytext)
library(stm)

df <- tibble(reviews = c("buenisimoooooo", "excelente", "excelent", 
                         "awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone", 
                         "phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase", 
                         "//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late", 
                         "good phone good reception home fringe area screen lovely just right size good buy", "@#haha", "phone verizon contract phone buyer beware", "这东西太棒了", 
                         "excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund", 
                         "good phone price fine", "phone star battery little soon yes"), 
             rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1), 
             source = c("amazon", "bestbuy", "amazon", "newegg", "amazon", 
                        "amazon", "zappos", "newegg", "amazon", "amazon", 
                        "amazon", "amazon", "amazon", "zappos", "amazon", 
                        "amazon", "newegg", "amazon", "amazon", "amazon"))


tidy_df <- df %>%
  mutate(doc_id = row_number()) %>%
  unnest_tokens(word, reviews)

tidy_df
#> # A tibble: 154 x 4
#>    rating source  doc_id word          
#>     <dbl> <chr>    <int> <chr>         
#>  1      4 amazon       1 buenisimoooooo
#>  2      4 bestbuy      2 excelente     
#>  3      4 amazon       3 excelent      
#>  4      4 newegg       4 awesome       
#>  5      4 newegg       4 phone         
#>  6      4 newegg       4 awesome       
#>  7      4 newegg       4 price         
#>  8      4 newegg       4 almost        
#>  9      4 newegg       4 month         
#> 10      4 newegg       4 issue         
#> # … with 144 more rows

Notice that you still have all the information you had before; all the information is still there, but it is arranged in a different structure. You can fine-tune the tokenization process to fit your particular analysis needs, perhaps dealing with non-English however you need, or keeping/not keeping punctuation, etc. This is where empty documents get thrown out, if appropriate for you.

Next, transform this tidy data structure into a sparse matrix, for use in topic modeling. The columns correspond to the words and the rows correspond to the documents.

sparse_reviews <- tidy_df %>%
  count(doc_id, word) %>%
  cast_sparse(doc_id, word, n)

colnames(sparse_reviews) %>% head()
#> [1] "buenisimoooooo" "excelente"      "excelent"       "almost"        
#> [5] "awesome"        "blu"
rownames(sparse_reviews) %>% head()
#> [1] "1" "2" "3" "4" "5" "8"

Next, make a dataframe of covariate (i.e. meta) information to use in topic modeling from the tidy dataset you already have.

covariates <- tidy_df %>%
  distinct(doc_id, rating, source)

covariates
#> # A tibble: 18 x 3
#>    doc_id rating source 
#>     <int>  <dbl> <chr>  
#>  1      1      4 amazon 
#>  2      2      4 bestbuy
#>  3      3      4 amazon 
#>  4      4      4 newegg 
#>  5      5      4 amazon 
#>  6      8      4 newegg 
#>  7      9      1 amazon 
#>  8     10      4 amazon 
#>  9     11      3 amazon 
#> 10     12      1 amazon 
#> 11     13      4 amazon 
#> 12     14      3 zappos 
#> 13     15      1 amazon 
#> 14     16      2 amazon 
#> 15     17      4 newegg 
#> 16     18      4 amazon 
#> 17     19      1 amazon 
#> 18     20      1 amazon

Now you can put this together into stm(). For example, if you want to train a topic model with the document-level covariates looking at whether topics change a) with source and b) smoothly with rating, you would do something like this:

topic_model <- stm(sparse_reviews, K = 0, init.type = "Spectral",
                   prevalence = ~source + s(rating),
                   data = covariates,
                   verbose = FALSE)

^{Created on 2019-08-03 by the reprex package (v0.3.0)}