Search code examples
pythonnlpspacynamed-entity-recognition

How can I iterate on a column with spacy to get named entities?


I got a dataframe with a column named "categories". Some data of this column looks like this {[], [], [amazon], [clothes], [telecommunication],[],...}. Every row has only one of this values. My task is now to give this values their entities. I tried a lot but it didn't go well. This was my first attempt

import spacy
nlp = spacy.load("de_core_news_sm")
doc=list(nlp.pipe(df.categories))
print([(X.text, X.label_) for X in doc.ents])
AttributeError 'list' object has no attribute 'ents'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in ----> 1 print([(X.text, X.label_) for X in doc.ents])
AttributeError: 'list' object has no attribute 'ents'

My second attempt:

for token in doc:
print(token.doc, token.pos_, token.dep_)
AttributeError 'spacy.tokens.doc.Doc' object has no attribute 'pos_'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in 1 for token in doc: ----> 2 print(token.doc, token.pos_, token.dep_)
AttributeError 'spacy.tokens.doc.Doc' object has no attribute 'pos_'

Third attempt:

docs = df["categories"].apply(nlp)
for token in docs:
    print(token.text, token.pos_, token.dep_)
AttributeError 'spacy.tokens.doc.Doc' object has no attribute 'docs'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in 1 docs = df["categories"].apply(nlp) 2 for token in docs: ----> 3              print(token.docs, token.pos_, token.dep_) 
AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'docs'

I just want to iterate spacy on this column to give me for the values an entity. For the empty values it should give me no entity. The column is a string. Thanks for help.


Solution

  • You have list with many doc and you have to use extra for-loop to work with every doc separatelly.

    docs = list(nlp.pipe(df.categories))   # variable `docs` instead of `doc`
    
    for doc in docs:   
        print([(X.text, X.label_) for X in doc.ents])
    

    and

    docs = list(nlp.pipe(df.categories))   # variable `docs` instead of `doc`
    
    for doc in docs:   
        for token in doc:
            print(token.text, token.pos_, token.dep_)
    

    Documentations Language Processing Pipelines shows it like

    for doc in nlp.pipe(df.categories):   
        print([(X.text, X.label_) for X in doc.ents])
        for token in doc:
            print(token.text, token.pos_, token.dep_)
    

    And the same problem is with apply(nlp)

    docs = df["categories"].apply(nlp)
    
    for doc in docs:
        for token in doc:
            print(token.text, token.pos_, token.dep_)
    

    Full working example:

    import spacy
    import pandas as pd
    
    df = pd.DataFrame({
        'categories': ['amazon', 'clothes', 'telecommunication']
    })
    
    nlp = spacy.load("de_core_news_sm")
    
    print('\n--- version 1 ---\n')
    
    docs = list(nlp.pipe(df.categories))
    
    for doc in docs:
        print([(X.text, X.label_) for X in doc.ents])
        
        for token in doc:
            print(token.text, token.pos_, token.dep_)
    
    print('\n--- version 2 ---\n')
    
    docs = df["categories"].apply(nlp)
    
    for doc in docs:
        for token in doc:
            print(token.text, token.pos_, token.dep_)