Search code examples
pythonnlpspacychunkspos-tagger

Filtering SpaCy noun_chunks by pos_tag


As the subj line says, I'm trying to extract elements of noun_chunks based on their individual POS tags. It seems that elements of a noun_chunk do not have access to the global sentence POS tags.

To demonstrate the issue:


[i.pos_ for i in nlp("Great coffee at a place with a great view!").noun_chunks]
>>> 
AttributeError: 'spacy.tokens.span.Span' object has no attribute 'pos_'

Here is my inefficient solution:

def parse(text):
    doc = nlp(text.lower())
    tags = [(idx,i.text,i.pos_) for idx,i in enumerate(doc)]

    chunks = [i for i in doc.noun_chunks]

    indices = []
    for c in chunks:
        indices.extend(j for j in range(c.start_char,c.end_char))
    non_chunks = [w for w in ''.join([i for idx,i in enumerate(text) if idx not in indices]).split(' ') 
                  if w != '']

    chunk_words = [tup[1] for tup in tags if tup[1] not in non_chunks and tup[2] not in ['DET','VERB','SYM','NUM']] #these are the POS tags which I wanted to filter out from the beginning!

    new_chunks = []
    for c in chunks:
        new_words = [w for w in str(c).split(' ') if w in chunk_words]
        if len(new_words) > 1:
            new_chunk = ' '.join(new_words)
            new_chunks.append(new_chunk)
    return new_chunks

parse(
"""
I may be biased about Counter Coffee since I live in town, but this is a great place that makes a great cup of coffee. I have been coming here for about 2 years and wish I would have found it sooner. It is located right in the heart of Forest Park and there is a ton of street parking. The coffee here is great....many other words could describe it, but that sums it up perfectly. You can by coffee by the pound, order a hot drink, and they also have food. On the weekend, there are donuts brought in from Do-Rite Donuts which have almost a cult like following. The food is a little on the high end price wise, but totally worth it. I am a self admitted latte snob and they make an amazing latte here. You can add skim, whole, almond or oat milk and they will make it happen. I always order easy foam and they always make it perfectly. My girlfriend loves the Chai Latte with Oat Milk and I will admit it is pretty good. Give them a try.
""")

>>>
['counter coffee',
 'great place',
 'great cup',
 'forest park',
 'street parking',
 'many other words',
 'hot drink',
 'almost cult',
 'high end price',
 'latte snob',
 'amazing latte',
 'oat milk',
 'easy foam',
 'chai latte',
 'oat milk']

Any quicker approaches to the same solution would be welcomed!


Solution

  • This doesn't work:

    [i.pos_ for i in nlp("Great coffee at a place with a great view!").noun_chunks]
    

    because noun_chunks returns Span objects, not Token objects.

    You can get to the POS tags within each noun chunk by iterating over the tokens:

    nlp = spacy.load("en_core_web_md")
    for i in nlp("Great coffee at a place with a great view!").noun_chunks:
        print(i, [t.pos_ for t in i])
    

    which will give you

    Great coffee ['ADJ', 'NOUN'] 
    a place ['DET', 'NOUN'] 
    a great view ['DET', 'ADJ', 'NOUN']