Getting a regex trie to run faster?

I have a 50mb regex trie that I'm using to split phrases apart.

Here is the relevant code:

import io
import re

with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
        regex = myfile.read()


while True == True:
    Password = input("Enter a phrase to be split: ")

    Words = re.findall(regex, Password)

    print(Words)

Since the regex is so large, this takes forever!

Here is the code I'm trying now, with re.compile(TempRegex):

import io
import re

with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
        TempRegex = myfile.read()

regex = re.compile(TempRegex)

while True == True:
    Password = input("Enter a phrase to be split: ")

    Words = re.findall(regex, Password)

    print(Words)

What I'm trying to do is I'm trying to check to see if an entered phrase is a combination of names. For example, the phrase "johnsmith123" to return ['john', 'smith', '123']. The regex file was created by a tool from a word list of every first and last name from Facebook. I want to see if an entered phrase is a combination of words from that wordlist essentially ... If johns and mith are names in the list, then I would want "johnsmith123" to return ['john', 'smith', '123', 'johns', 'mith'].

Solution

I don't think that regex is the way to go here. It seems to me that all you are trying to do is to find a list of all of the substrings of a given string that happen to be names.

If the user's input is a password or passphrase, that implies a relatively short string. It's easy to break that string up into the set of possible substrings, and then test that set against another set containing the names.

The number of substrings in a string of length n is n(n+1)/2. Assuming that no one is going to enter more than say 40 characters you are only looking at 820 substrings, many of which could be eliminated as being too short. Here is some code to do that:

def substrings(s, min_length=1):
    for start in range(len(s)):
        for length in range(min_length, len(s)-start+1):
            yield s[start:start+length]

So the problem then is loading the names into a suitable data structure. Your regex is 50MB, but considering the snippet that you showed in one of your comments, the amount of actual data is going to be a lot smaller than that due to the overhead of the regex syntax.

If you just used text files with one name per line you could do this:

names = set(word.strip().lower() for word in open('names.txt'))

def substrings(s, min_length=1):
    for start in range(len(s)):
        for length in range(min_length, len(s)-start+1):
            yield s[start:start+length]

s = 'johnsmith123'
print(sorted(names.intersection(substrings(s)))

Might give output:

['jo', 'john', 'johns', 'mi', 'smith']

I doubt that there will be memory issues given the likely small data set, but if you find that there's not enough memory to load the full data set at once you could look at using sqlite3 with a simple table to store the names. This will be slower to query, but it will fit in memory.

Another way could be to use the shelve module to create a persistent dictionary with names as keys.