Search code examples
pythonpandasspacytext-classification

Use Spacy with Pandas


I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:

Screenshot

Below is the code I used to apply to my full dataset using Pandas:


Messages = pd.read_csv('Messages.csv', encoding='cp1252')
    
Messages['Body'] = Messages['Body'].astype(str)

Messages['NLP_Result'] = nlp(Messages['Body'])._.cats

But it gives me the error:

ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>

The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.

Thanks so much

I tried Pandas apply method too, but had no luck. Code used:

Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats

The error I got: AttributeError: 'Series' object has no attribute '_'

Expectation is to generate 3 columns as described above


Solution

  • You should provide a callable into Series.apply call:

    Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)
    

    Here, each value in the NLP_Result column will be assigned to x variable.

    The nlp(x) will create an NLP object that contains the necessary properties you'd like to access. Then, the nlp(x)._.cats will return the expected value.

    import spacy
    import classy classification
    import csv
    import pandas as pd 
    
    with open ('Deliveries.txt', 'r') as d:
        Deliveries = d.read().splitlines()
    with open ("Not Spam.txt", "r") as n:
        Not_Spam = n.read().splitlines()
    
    data = {}
    data["Deliveries"] = Deliveries
    data["Not_Spam"] = Not_Spam
    
    # NLP model
    nlp = spacy.blank("en")
    nlp.add pipe("text_categorizer",
        config={
            "data": data,
            "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
            "device": "gpu"
        }
    )
    
    Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)