Search code examples
pythonregexpython-3.xsearchtrie

Finding names inside a string


I'm trying to check an inputted string for names. The data I have available is every first and last name from Facebook.

What I want my program to do is take the input "johnsmith123" (for example) and return ['john', 'smith', '123']. If 'johns' and 'mith' are names in the list, I would want it to return ['john', 'smith', '123', 'johns', 'mith']. Basically: every possible combination from words in the list that can make up the entered phrase.

I know that regex tries are really, really fast for lookups. Using a tool called RegexFormat 7, I turned the wordlist into a 50mb regex trie.

Here is the code I am now trying to run, using that trie:

import io
import re

with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
        TempRegex = myfile.read()

regex = re.compile(TempRegex)

while True == True:
    Password = input("Enter a phrase to be split: ")

    Words = re.findall(regex, Password)

    print(Words)

The program never reaches the input part. I am assuming it will take very long to compile such a large regex trie.

What I need to know is if there is some way to do this compilation process once, save the regular expression object to my disk, and just load the pre-compiled object to be used into the module instead of having to compile every time?

It is the compiling that is taking up so much time. I know that the search would actually happen quite quickly. If I can do the compilation process once, I can just run the compile overnight ...

If this is not feasible, what else can I do? The data I have available is a 100mb word list of every first and last name from Facebook, and a regex trie derived from that wordlist


Solution

  • I'm skeptical that a single massive regular expression is the best solution here. A single hash table of all possible first names might be faster.

    all_first_names = set(['dan', 'bob', 'danny'])
    
    username = 'dannysmith123'
    
    # Get "extra" parts of the username if they exist
    m = re.match(r'^([a-zA-Z]+)(.*)$', username)
    name, extra = m.group(1), m.group(2)
    
    # Get a list of possible first/last name splits
    # [('d', 'annysmith'), ('da', 'nnysmith'), ...]
    name_splits = [(name[:i], name[i:]) for i in range(1, len(name)+1)]
    
    # Check each one of these splits to see if the first name
    # is present in the master first name list, if so, add it to
    # the list of valid matches.
    match_list = []
    for ns in name_splits:
        if ns[0] in all_first_names:
            match_list.extend(ns)
            if extra:
                match_list.append(extra)
                extra = None