Search code examples
pythonnlptfidfvectorizer

Python: list object has no attribute 'lower' - but corpus is already in lower case


My corpus is a series of documents with twitter data, and has been cleaned and pre-processed to be best of my knowledge (even including emoji)- example below:

    0         [national, interest, think, worth, holding, ta...
    1         [must, accurate, diane, abbott, done, calculat...

I then instantiate TFIDF:

    # Instantiate vectoriser
    vect = TfidfVectorizer()

    # Fit
    vect = TfidfVectorizer(min_df=10, ngram_range = (1,3)).fit(text)

When I try to fit this, I get:

   AttributeError: 'list' object has no attribute 'lower' 

But I've already converted everything to lower case. Is this something to do with the fact that it's a series?


Solution

  • Convert a collection of raw documents to a matrix of TF-IDF features.

    You are passing in this sense a series of list in your dataframe replicated here:

    from sklearn.feature_extraction.text import TfidfVectorizer
    import pandas as pd
    
    l1 = 'national, interest, think, worth, holding,'.split(',')
    l2 = 'must, accurate, diane, abbott, done'.split(',')
    
    df = pd.DataFrame([[l1],[l2]])
    
    text = df[0]
    

    which returns your text parameter as:

    0    [national,  interest,  think,  worth,  holding, ]
    1            [must,  accurate,  diane,  abbott,  done]
    Name: 0, dtype: object
    

    This obviously will not work and as pointed out, TfidfVectorizer accepts strings or documents. In your case and as per the example, although slightly counter-intuitive from your example.

    corpus = text.apply(lambda x: ','.join(x)).to_list() # converts your series into a list of strings
    
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    
    print(vectorizer.get_feature_names())
    
    ['abbott', 'accurate', 'diane', 'done', 'holding', 'interest', 'must', 'national', 'think', 'worth']