search memory solr nlp named-entity-recognition

List-based Named Entity Recognition for search engine: how to scale?

I'm working on a search engine for documents stored in Solr.

In the user query, I want to detect Named Entitities (persons, organizations, cities...).

The example query is:

barack obama wife age

In this query, I want to detect that "barack obama" is a person.

Since queries are not real phrases, it is difficult for classic NER (Spacy, Stanford NER...) to work properly. So, I adopted this approach:

store in a dictionary all entities found in the documents (sorted by decreasing length)

loop the dictionary, to see if the user query contains entities

def find_entities(query,entities_dict):

    entities=[]
    new_query=query.lower()

    for entity in entities_dict:
        if find_substring(entity,new_query):
            entities.append({entity:entities_dict[entity]})
            new_query = re.sub(r'\b{}\b'.format(entity), '', new_query)
    return(new_query,entities)

At the moment, I have about 200k entities in my Solr index: dictionary creation takes a few minutes; after the loading, this approach works well, is fast and not so memory consuming.

In the near future, I will have 50-100 million entities.

I think that it will be impossible to store these entities in memory.

How can I change my approach? I'm looking for advice for the algorithm, the memory management and data structures to be used.

Solution

One obvious brute-force solution is just to make your search index distributed: you create e.g. 100 nodes with a dictionary of 1 million entities in each one, you run them in parallel, and you merge the results.

Another solution (which may be complementary to splitting the index) is to keep your entities not in a simple list, but instead in a prefix tree (aka trie) or in a graph of Aho-Corasick. These data structures speed up substring search by a lot, because they try to match all the entities with your query in a single pass, exploiting the fact that many entities have identical substrings in them.

In fact, I have used pyahocorasick to look up a few million entities (movies, songs, actors, etc.) in short queries, and it seemed to scale very well. Formally, the time complexity of Aho-Corasick does not depend on the total number of entities, only on the number of matched entities in the concrete query. Therefore, if the search gets slow (which is unlikely), it makes sense to see what entities generate lots of false positive matches and remove them from the index. In my case, after removing very common entities such as "It" (it's a movie name!), the matcher sped up even more.

Here is an example. First, we get the entities to search for (15K cities):

pip install pyahocorasick
wget https://simplemaps.com/static/data/world-cities/basic/simplemaps_worldcities_basicv1.6.zip
unzip simplemaps_worldcities_basicv1.6.zip

Then we create the automaton that can match entities (cities):

import pandas as pd
import re
import ahocorasick
cities = pd.read_csv('worldcities.csv')

def preprocess(text):
    """
    Add underscores instead of non-alphanumeric characters 
    and on the word boundaries in order to tell words from word substrings.
    """
    return '_{}_'.format(re.sub('[^a-z0-9]', '_', text.lower()))

index = ahocorasick.Automaton()
for city in cities.city:
    index.add_word(preprocess(city), city)
index.make_automaton()
# this object can be pickled to disk and then loaded back

And now actually apply this index to lookup entities in the text:

def find_cities(text, searcher):
    result = dict()
    for end_index, city_name in searcher.iter(preprocess(text)):
        end = end_index - 1
        start = end - len(city_name)
        result[(start, end)] = city_name
    return result

print(find_cities( 'Tver’ is somewhere between Moscow and Saint Petersburg', index))
# {(0, 5): 'Tver’', (27, 33): 'Moscow', (38, 54): 'Saint Petersburg', (44, 54): 'Petersburg'} 
# the search takes about 0.1 ms

Naive search gives the same results but takes about 10 ms:

for city in cities.city:
    idx = text.find(city)
    if idx >=0:
        print(idx, city)

Here is the notebook with my code.