In R, I can extract rows (documents) which contain a particular term, say 'toyota' by intersecting a document term matrix (dtm) with required column name like so:
dtm <- DocumentTermMatrix(mycorpus, control = list(tokenize = TrigramTokenizer))
x.df<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota"),drop=FALSE])
The problem is that I can't find an equivalent method in Python sklearn package. so I go about it in a roundabout way:
MVP here:
rows_to_keep=tfidf_df[tfidf_df.toyota.notnull()].index
data=my_df.loc[rows_to_keep,:]
print(data.shape)
This works. Problem is how do I pass an iterator to this statement?
car_make=['toyota','ford','nissan','gmotor','honda','suzuki']
Then for zentity in car_make:
rows_to_keep=tfidf_df[tfidf_df.zentity.notnull()].index
does not work.
AttributeError: 'SparseDataFrame' object has no attribute 'zentity'
I purposefully chose zentity to avoid equivalence with any column name in the tfidf.
Is there a clean way to make the intersection and extract only rows where column is not null (NaN)? Any help will be appreciated.
Rather than
rows_to_keep=tfidf_df[tfidf_df.zentity.notnull()].index
You should use something like
rows_to_keep=tfidf_df[tfidf_df[zentity].notnull()].index
Using a variable like zentity, even if it stores a string, to attribute access the column of tfidf_df seems like it will always fail. I'm not sure why right now (I think it has to do with how the DataFrame treats column names when you create it, and how class object attribute access generally works), but I'll look it up.