Search code examples
pythonscikit-learntfidfvectorizer

Why does TfidVectorizer.fit_transform() change the number of samples and labels for my text data?


I have a data set that contains 3 columns for 310 data. The columns are all text. One column is text input by a user into an inquiry form and the second column are the labels (one of six labels) that say which inquiry category the input falls into.

>>> data.shape
(310 x 3)

I am doing the following preprocessing to my data before I run it through the KMeans algorithm from sklearn.cluster

v = TfidfVectorizer()
vectorized = v.fit_transform(data)

Now,

>>> vectorized.shape
(3,4)

From where I'm looking I seem to have lost data. I no longer have my 310 samples. I believe the shape of vectorized refers to [n_samples, n_features].

Why does the value of samples and features change? I would expect the number of samples to be 310 and the number of features to be 6 (the unique number of groupings for my labeled data.


Solution

  • The problem is that TfidfVectorizer() cannot be applied on three columns at a time.

    According to the documentation:

    fit_transform(self, raw_documents, y=None)

    Learn vocabulary and idf, return term-document matrix.

    This is equivalent to fit followed by transform, but more efficiently implemented.

    Parameters: raw_documents : iterable
    an iterable which yields either str, unicode or file objects

    Returns: X : sparse matrix, [n_samples, n_features]
    Tf-idf-weighted document-term matrix.

    Hence, when apply on single column of text data only. In your code, it had just iterated through the column names and create a transform for it.

    An example to understand, what is happening:

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    data = pd.DataFrame({'col1':['this is first sentence','this one is the second sentence'],
                        'col2':['this is first sentence','this one is the second sentence'],
                        'col3':['this is first sentence','this one is the second sentence'] })
    vec = TfidfVectorizer()
    vec.fit_transform(data).todense()
    
    # 
    # matrix([[1., 0., 0.],
    #         [0., 1., 0.],
    #         [0., 0., 1.]])
    
    vec.get_feature_names()
    
    # ['col1', 'col2', 'col3']
    

    Now, the solution is that you have to join all the three column into one single column or apply vectorizer separately on each column and then append them at the end.

    Approach 1

    data.loc[:,'full_text'] = data.apply(lambda x: ' '.join(x), axis=1)
    vec = TfidfVectorizer()
    X = vec.fit_transform(data['full_text']).todense()
    print(X.shape)
    # (2, 7)
    
    print(vec.get_feature_names())
    # ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
    

    Approach 2

    from scipy.sparse import hstack
    import numpy as np
    
    vec={}
    X = []
    for col in data[['col1','col2','col3']]:
        vec[col]= TfidfVectorizer()
        X = np.append(X, 
                      vec[col].fit_transform(data[col]))
    
    stacked_X = hstack(X).todense()
    stacked_X.shape
    # (2, 21)
    
    for col, v in vec.items():
        print(col)
        print(v.get_feature_names())
    
    # col1
    # ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
    # col2
    # ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']
    # col3
    # ['first', 'is', 'one', 'second', 'sentence', 'the', 'this']