Search code examples
python-3.xscikit-learntf-idfsklearn-pandas

sklearn tfidfvectorizer: how to intersect a tfidf frame on a column?


In R, I can extract rows (documents) which contain a particular term, say 'toyota' by intersecting a document term matrix (dtm) with required column name like so:

dtm <- DocumentTermMatrix(mycorpus, control = list(tokenize = TrigramTokenizer))
x.df<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota"),drop=FALSE])

The problem is that I can't find an equivalent method in Python sklearn package. so I go about it in a roundabout way:

  1. first i get index values of rows where the relevant column ("toyota") in the tfidf frame is not null;columns names are feature names.
  2. then I slice the main pandas dataframe on identified row indices.
  3. Now I have a dataframe where each row contains "toyota".

MVP here:

rows_to_keep=tfidf_df[tfidf_df.toyota.notnull()].index data=my_df.loc[rows_to_keep,:] print(data.shape)

This works. Problem is how do I pass an iterator to this statement?

car_make=['toyota','ford','nissan','gmotor','honda','suzuki']

Then for zentity in car_make:

rows_to_keep=tfidf_df[tfidf_df.zentity.notnull()].index

does not work.

AttributeError: 'SparseDataFrame' object has no attribute 'zentity'

I purposefully chose zentity to avoid equivalence with any column name in the tfidf.

Is there a clean way to make the intersection and extract only rows where column is not null (NaN)? Any help will be appreciated.


Solution

  • Rather than rows_to_keep=tfidf_df[tfidf_df.zentity.notnull()].index

    You should use something like rows_to_keep=tfidf_df[tfidf_df[zentity].notnull()].index

    Using a variable like zentity, even if it stores a string, to attribute access the column of tfidf_df seems like it will always fail. I'm not sure why right now (I think it has to do with how the DataFrame treats column names when you create it, and how class object attribute access generally works), but I'll look it up.