Search code examples
pythontokenize

how to tokenize strings based on a word list


I want to convert variable names to business friendly names based on a list of known words in Python 3.6.

My list of known words looks like this i.e. the first element is the known word, second is the friendly name for it:

Acct,Account
Account,Account
Num,Number
Number,Number
Payee,Payee
Pymt,Payment
Type,Type

And my variables look like this:

ACCOUNTNUM
ACCT_NUM
ACCTNUM
PAYEETYPE
PAYEE_TYP
PYMT_DT

I want the output for the above variables list to be like this:

Account Number
Account Number
Account Number
Payee Type
Payee Typ
Payment Dt

How can I do this ? The list of variable names to convert is about 10,000. The list of known words is 400,000. Both are available in files.


Solution

  • You can create a translation mapping of the known words, then use re.split to split the variable names with the known words, replace the matches with the mapped words and consolidate the spaces with another regex substitution:

    import re
    known_words = '''Acct,Account
    Account,Account
    Num,Number
    Number,Number
    Payee,Payee
    Pymt,Payment
    Type,Type'''
    variables = '''ACCOUNTNUM
    ACCT_NUM
    ACCTNUM
    PAYEETYPE
    PAYEE_TYP
    PYMT_DT'''
    m = {k.upper(): v for line in known_words.splitlines() for k, v in (line.split(','),)}
    print([re.sub(' +', ' ', ' '.join(m.get(t, t).replace('_', '').title() for t in re.split('(%s)' % '|'.join(map(re.escape, m)), v, flags=re.IGNORECASE) if t)) for v in variables.splitlines()])
    

    This outputs:

    ['Account Number', 'Account Number', 'Account Number', 'Payee Type', 'Payee Typ', 'Payment Dt']