Search code examples
pythonnlpspacynamed-entity-recognition

NER - how to check if a common noun indicates a place (subcategorization)


I am looking for a way to find, in a sentence, if a common noun refers to places. This is easy for proper nouns, but I didn't find any straightforward solution for common nouns.

For example, given the sentence "After a violent and brutal attack, a group of college students travel into the countryside to find refuge from the town they fled, but soon discover that the small village is also home to a coven of serial killers" I would like to mark the following nouns as referred to places: countryside, town, small village, home.

Here is the code I'm using:

import spacy
nlp = spacy.load('en_core_web_lg')

# Process whole documents
text = ("After a violent and brutal attack, a group of college students travel into the countryside to find refuge from the town they fled, but soon discover that the small village is also home to a coven of satanic serial killers")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Which gives as output the following:

Noun phrases: ['a violent and brutal attack', 'a group', 'college students', 'the countryside', 'refuge', 'the town', 'they', 'the small village', 'a coven', 'serial killers']
Verbs: ['travel', 'find', 'flee', 'discover']

Solution

  • You can use WordNet for this.

    from nltk.corpus import wordnet as wn
    
    loc = wn.synsets("location")[0]
    
    def is_location(candidate):
        for ss in wn.synsets(candidate):
            # only get those where the synset matches exactly
            name = ss.name().split(".", 1)[0]
            if name != candidate:
                continue
            hit = loc.lowest_common_hypernyms(ss)
            if hit and hit[0] == loc:
                return True
        return False
    
    # true things
    for word in ("countryside", "town", "village", "home"):
        print(is_location(word), word, sep="\t")
    
    # false things
    for word in ("cat", "dog", "fish", "cabbage", "knife"):
        print(is_location(word), word, sep="\t")
    
    

    Note that sometimes the synsets are wonky, so be sure to double-check everything.

    Also, for things like "small village", you'll have to pull out the head noun, but it'll just be the last word.