Hope all of you guys are healthy and well. I am new to the world of NLP and my question may sound stupid, so I apologize in advance.I would like to perform NLP on some text data which is labeled and run a text mining predictive model. I have four text columns that can be used as predictors and my labeled column is my class variable. Perhaps, the following can give you a glimpse of the data set
var1 var2 var3 var4 class_var
NA text text NA 0
text text NA text 1
text NA NA text 1
NA NA NA text 0
NA text text text 1
As shown, in some columns there are no texts ( I put NAs
) I have texts in other columns.
That being said, my question whether I should combine all text columns into one?
if so, what would be an appropriate method for dealing with this issue?
I truly appreciated your help guys.
Many thanks!
There are way too many options here but seeing as your data is already split into four columns, maybe you can first just replace the texts with a 1 if text is present or 0 for NA and see how well you can predict the class_var with a simple logistic regression as a start. From there, you could go into tokenizers etc.