Search code examples
pythonscikit-learnvectorizationseries

'Series' object has no attribute 'lower' tfidf


I tried the tfidf to prepare my data but I have the same error.

X = df['Description'], df['Type']
y =df['Description'], df['Type']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)


df['Description']=[" ".join(Description) for Description in df['Description'].values]

tfidf = TfidfVectorizer(stop_words='english')
t_x_train = tfidf.fit_transform(X_train)
t_x_test = tfidf.transform(y_test)

When I run it this happens AttributeError: 'Series' object has no attribute 'lower'


Solution

  • Sklearn tries to apply str.lower() on the elements within y_test. However, the datatypes seem to be not compatible.

    Please check:

    1. the datatypes using y_test.dtypes or convert to string as shown below
    2. whether y_test should be replaced with X_test when passed to tfidf
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    corpus = [
     ('This is the first document.',4),
     ('This document is the second document.',3),
     ('And this is the third one.',2),
     ('Is this the first document?',1)
    ]
    
    df= pd.DataFrame(corpus, columns = ['Description', 'Type'])
    
    
    X = df['Description']
    # make sure your target is also a series of strings if not already
    y = df['Type'].astype('str')
    
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)
    # df['Description']=[" ".join(Description) for Description in df['Description'].values]
    
    tfidf = TfidfVectorizer(stop_words='english')
    t_x_train = tfidf.fit_transform(X_train)
    t_x_test = tfidf.transform(y_test)