Search code examples
python-3.xchunksnamed-entity-recognition

Extracting full names with ne_chunks


Newbie here. I'm trying to extract full names of people and organisations using the following code.

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []
    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(' '.join([token for token, pos in i.leaves()]))
            if current_chunk:
                named_entity = ' '.join(current_chunk)
                if named_entity not in continuous_chunk:
                    continuous_chunk.append(named_entity)
                    current_chunk = []
                else:
                    continue
                return continuous_chunk

            
>>> my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
>>> get_continuous_chunks(my_sent)
['Toni']

As you can see it is returning only the first proper noun. Not the full name, and not any other proper nouns in the string.

What am I doing wrong?


Solution

  • Here is some working code.

    The best thing to do is to step through your code and put a lot of print statements at different places. You will see where I printed the type() and the str() value of the items you are iterating on. I find this helps me to visualize and think more about the loops and conditionals I am writing if I can see them listed.

    Also, oops, I inadvertently named all of the variables, "contiguous" instead of "continuous" ... not sure why ... contiguous might be more accurate

    Code:

    from nltk import ne_chunk, pos_tag, word_tokenize
    from nltk.tree import Tree
    
    
    def get_continuous_chunks(text):
        chunked = ne_chunk(pos_tag(word_tokenize(text)))
        current_chunk = []
        contiguous_chunk = []
        contiguous_chunks = []
    
        for i in chunked:
            print(f"{type(i)}: {i}")
            if type(i) == Tree:
                current_chunk = ' '.join([token for token, pos in i.leaves()])
                # Apparently, Tony and Morrison are two separate items,
                # but "Random House" and "New York City" are single items.
                contiguous_chunk.append(current_chunk)
            else:
                # discontiguous, append to known contiguous chunks.
                if len(contiguous_chunk) > 0:
                    contiguous_chunks.append(' '.join(contiguous_chunk))
                    contiguous_chunk = []
                    current_chunk = []
    
        return contiguous_chunks
    
    my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
    
    
    print()
    contig_chunks = get_continuous_chunks(my_sent)
    print(f"INPUT: My sentence: '{my_sent}'")
    print(f"ANSWER: My contiguous chunks: {contig_chunks}")
    

    Exection:

    (venv) [ttucker@zim stackoverflow]$ python contig.py 
    
    <class 'nltk.tree.Tree'>: (PERSON Toni/NNP)
    <class 'nltk.tree.Tree'>: (PERSON Morrison/NNP)
    <class 'tuple'>: ('was', 'VBD')
    <class 'tuple'>: ('the', 'DT')
    <class 'tuple'>: ('first', 'JJ')
    <class 'tuple'>: ('black', 'JJ')
    <class 'tuple'>: ('female', 'NN')
    <class 'tuple'>: ('editor', 'NN')
    <class 'tuple'>: ('in', 'IN')
    <class 'tuple'>: ('fiction', 'NN')
    <class 'tuple'>: ('at', 'IN')
    <class 'nltk.tree.Tree'>: (ORGANIZATION Random/NNP House/NNP)
    <class 'tuple'>: ('in', 'IN')
    <class 'nltk.tree.Tree'>: (GPE New/NNP York/NNP City/NNP)
    <class 'tuple'>: ('.', '.')
    INPUT: My sentence: 'Toni Morrison was the first black female editor in fiction at Random House in New York City.'
    ANSWER: My contiguous chunks: ['Toni Morrison', 'Random House', 'New York City']
    

    I am also a little unclear as to exactly what you were looking for, but from the description, this seems like it.