Search code examples
rnlptext-miningdata-cleaningtm

Dealing with several text columns in a labeled data set while running NLP in R


Hope all of you guys are healthy and well. I am new to the world of NLP and my question may sound stupid, so I apologize in advance.I would like to perform NLP on some text data which is labeled and run a text mining predictive model. I have four text columns that can be used as predictors and my labeled column is my class variable. Perhaps, the following can give you a glimpse of the data set

 var1    var2  var3    var4      class_var
  NA     text  text     NA          0
  text   text   NA     text         1
  text    NA    NA     text         1
  NA      NA    NA     text         0
  NA     text  text    text         1  

As shown, in some columns there are no texts ( I put NAs) I have texts in other columns. That being said, my question whether I should combine all text columns into one? if so, what would be an appropriate method for dealing with this issue?

I truly appreciated your help guys.

Many thanks!


Solution

  • There are way too many options here but seeing as your data is already split into four columns, maybe you can first just replace the texts with a 1 if text is present or 0 for NA and see how well you can predict the class_var with a simple logistic regression as a start. From there, you could go into tokenizers etc.