Search code examples
nlpspacylemmatization

spacy lemmatization of nouns and noun chunks


I am trying to create a corpus of documents which consists of lemmatized nouns and noun-chunks. I am using this code:

import spacy
nlp = spacy.load('en_core_web_sm')

def lemmatizer(doc, allowed_postags=['NOUN']):                                                     
    doc = [token.lemma_ for token in doc if token.pos_ in allowed_postags]
    doc = u' '.join(doc)
    return nlp.make_doc(doc)


nlp.add_pipe(nlp.create_pipe('merge_noun_chunks'), after='ner')
nlp.add_pipe(lemmatizer, name='lemm', after='merge_noun_chunks')

doc_list = []                                                                                      
for doc in data:                                                                                    
    pr = nlp(doc)
    doc_list.append(pr) 

   

The sentence 'the euro area has advanced a long way as a monetary union' after identifiying noun-chunks ['the euro area', 'advanced', 'long', 'way', 'a monetary union'] and lemmatization gets to: ['euro', 'area', 'way', 'monetary', 'union']. Is there a way to combine the words of the identified noun-chunks to get an output like this: ['the euro area','way', 'a monetary union'] or ['the_euro_area','way', 'a_monetary_union']?

Thanks for your help!


Solution

  • I don't think your problem is about lemmatization. This method works for your example.

    # merge noun phrase and entities
    def merge_noun_phrase(doc):
        spans = list(doc.ents) + list(doc.noun_chunks)
        spans = spacy.util.filter_spans(spans)
        
        with doc.retokenize() as retokenizer:
            for span in spans:
                retokenizer.merge(span)
        return doc
    
    sentence = "the euro area has advanced a long way as a monetary union"
    doc = nlp(sentence)
    doc2 = merge_noun_phrase(doc)
    for token in doc2:
        print(token)
        #['the euro area', 'way', 'a monetary union']
    

    I have to note that I'm using spacy2.3.5, maybe spacy.util.filter_spans is deprecated in the newest version. This answer would help you. :)

    Module 'spacy.util' has no attribute 'filter_spans'

    And, if you still try to lemmatize noun chunks, you can do it as following:

    doc = nlp("the euro area has advanced a long way as a monetary union")
    for chunk in doc.noun_chunks:
        print(chunk.lemma_)
        #['the euro area', 'a monetary union']
    

    According to the answer in What is the lemma for 'two pets', "looking at the lemma on the span level is probably not very useful and it makes more sense to work on the token level."