Search code examples
pythonlistnlp

Argument 'string' has incorrect type (expected str, got list) Spacy NLP


I want to calculate cosine similarity, but I got an error message after converting the dataframe column to its list: Argument 'string' has incorrect type (expected str, got list).

import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

df= [['24, Single, Consultant, Canada, I am interested in visiting Isreal again'], ['18, Single, Student, I want to go back Costa Rica again'], ['45,Married, Unemployed, I want to take my family to Florida for the summer vacation']] 
df = pd.DataFrame(df, columns = ['Free Text'])
df["N_Application"]=range(0, len(df))

# convert datafram to list
data=df['Free Text'].tolist()
df_spacy=nlp(data)

I appreciate someone help me fix it, Thank you.


Solution

  • The way you get a function to operate across an entire pd.Series is to use .apply(). And you can chain .apply() calls.

    Example:

    # changing to strings instead of nested list
    l = ['24, Single, Consultant, Canada, I am interested in visiting Isreal again', 
         '18, Single, Student, I want to go back Costa Rica again', 
         '45,Married, Unemployed, I want to take my family to Florida for the summer vacation']
    
    # remove stop words and punctuation for later similarity calculations
    df_spacy = df['Free Text'].apply(nlp)\
                              .apply(lambda doc: nlp(' '.join(str(t) 
                                                     for t in doc 
                                                     if not t.is_stop 
                                                     and not t.is_punct)))
    

    Edit: per your comment, here is a similarity calculation between each row and all other rows:

    df_spacy.apply(lambda row: df_spacy\
            .apply(lambda doc: row.similarity(doc) if row != doc else None))
    

    Resulting similarity matrix:

              0         1         2
    0       NaN  0.776098  0.716560
    1  0.776098       NaN  0.705024
    2  0.716560  0.705024       NaN