Search code examples
pythonmachine-learningpysparktfidfvectorizer

How To Use TfidfVectorizer With PySpark


I am very new at using Pyspark and have some issues with Pyspark Dataframe.

I'm trying to implement the TF-IDF algorithm. I did it with pandas dataframe once. However, I started using Pyspark and now everything changed :( I can't use Pyspark Dataframe like dataframe['ColumnName']. When I write and run the code, it says dataframe is not iterable. This is a massive problem for me and has not been solved yet. The current problem below:

With Pandas:


tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(pandasDF['name'])
tfidf_tran = tfidf.transform(pandasDF['name'])

With PySpark:

tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
tfidf.fit(SparkDF['name'])
tfidf_tran = tfidf.transform(SparkDF['name'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19992/3734911517.py in <module>
     13 vocabulary = list(vocabulary)
     14 tfidf = TfidfVectorizer(vocabulary=vocabulary, dtype=np.float32)
---> 15 tfidf.fit(dataframe['name'])
     16 tfidf_tran = tfidf.transform(dataframe['name'])
     17 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit(self, raw_documents, y)
   1821         self._check_params()
   1822         self._warn_for_unused_params()
-> 1823         X = super().fit_transform(raw_documents)
   1824         self._tfidf.fit(X)
   1825         return self

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1200         max_features = self.max_features
   1201 
-> 1202         vocabulary, X = self._count_vocab(raw_documents,
   1203                                           self.fixed_vocabulary_)
   1204 

E:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1110         values = _make_int_array()
   1111         indptr.append(0)
-> 1112         for doc in raw_documents:
   1113             feature_counter = {}
   1114             for feature in analyze(doc):

E:\Anaconda\lib\site-packages\pyspark\sql\column.py in __iter__(self)
    458 
    459     def __iter__(self):
--> 460         raise TypeError("Column is not iterable")
    461 
    462     # string methods

TypeError: Column is not iterable

Solution

  • Tf-idf is the term frequency multiplied by the inverse document frequency. There isn't an explicit tf-idf vectorizer within the MlLib for dataframes in the Pyspark library, but they have 2 useful models that will help get you to the tf-idf. Using the HashingTF, you'd get the term frequencies. Using the IDF, you'd have the inverse document frequencies. Multiply the two together, and you should have an output matrix matching what you would be expecting from the TfidfVectorizer you specified originally.