Search code examples
pythonnlpnltkspacy

how can I convert entities(list) to dictionary? my tried code is commented and not working, NLP problem


how can I convert entities(list) to dictionary? my tried code is commented and not working, or instead of converting how can I rewrite entities to be like a dictionary? I want to convert in dictionary to be able to find 5 most frequently named people in the first 500 sentence.

! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]

import spacy

nlp = spacy.load('en')
entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:50]]
entities


#I TRIED THIS BUT IS WRONG
#def Convert(lst): 
#    res_dct = {lst[i]: lst[i + 1] for i in range(0, len(lst), 2)} 
#    return res_dct
#print(Convert(ent)) 


Solution

  • The list stored in variable entities is has type list[list[tuple[str, str]]], where the first entry in the tuple is the string for the entity and the second is the type of the entity, e.g.:

    >>> from pprint import pprint
    >>> pprint(entities)
    [[],
     [('Ishmael', 'GPE')],
     [('Some years ago', 'DATE')],
     [],
     [('November', 'DATE')],
     [],
     [('Cato', 'ORG')],
     [],
     [],
     [('Manhattoes', 'ORG'), ('Indian', 'NORP')],
     [],
     [('a few hours', 'TIME')],
    ...
    

    Then you can create a reverse dict in the following way:

    >>> sum(filter(None, entities), [])
    [('Ishmael', 'GPE'), ('Some years ago', 'DATE'), ('November', 'DATE'), ('Cato', 'ORG'), ('Manhattoes', 'ORG'), ('Indian', 'NORP'), ('a few hours', 'TIME'), ('Sabbath afternoon', 'TIME'), ('Corlears Hook to Coenties Slip', 'WORK_OF_ART'), ('Whitehall', 'PERSON'), ('thousands upon thousands', 'CARDINAL'), ('China', 'GPE'), ('week days', 'DATE'), ('ten', 'CARDINAL'), ('American', 'NORP'), ('June', 'DATE'), ('one', 'CARDINAL'), ('Niagara', 'ORG'), ('thousand miles', 'QUANTITY'), ('Tennessee', 'GPE'), ('two', 'CARDINAL'), ('Rockaway Beach', 'GPE'), ('first', 'ORDINAL'), ('first', 'ORDINAL'), ('Persians', 'NORP')]
    >>> from collections import defaultdict
    >>> type2entities = defaultdict(list)
    >>> for entity, entity_type in sum(filter(None, entities), []):
    ...   type2entities[entity_type].append(entity)
    ...
    >>> from pprint import pprint
    >>> pprint(type2entities)
    defaultdict(<class 'list'>,
                {'CARDINAL': ['thousands upon thousands', 'ten', 'one', 'two'],
                 'DATE': ['Some years ago', 'November', 'week days', 'June'],
                 'GPE': ['Ishmael', 'China', 'Tennessee', 'Rockaway Beach'],
                 'NORP': ['Indian', 'American', 'Persians'],
                 'ORDINAL': ['first', 'first'],
                 'ORG': ['Cato', 'Manhattoes', 'Niagara'],
                 'PERSON': ['Whitehall'],
                 'QUANTITY': ['thousand miles'],
                 'TIME': ['a few hours', 'Sabbath afternoon'],
                 'WORK_OF_ART': ['Corlears Hook to Coenties Slip']})
    

    The dict stored in variable type2entities is what you want. To get the most frequent people's names in the first 500 lines (and their corresponding number of mentions):

    >>> from collections import Counter
    >>> entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:500]]
    >>> person_cnt = Counter()
    >>> for entity, entity_type in sum(filter(None, entities), []):
    ...   if entity_type == 'PERSON':
    ...     person_cnt[entity] += 1
    ...
    >>> person_cnt.most_common(5)
    [('Queequeg', 17), ('don', 4), ('Nantucket', 2), ('Jonah', 2), ('Sal', 2)]