python machine-learning nlp classification fuzzywuzzy

Unable to detect gibberish names using Python

I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate account names could be comprised of all upper-case or all lower-case letters.

Disclaimer: this is just a internal research/experiment and no real action will be taken on the classifier outcome.

In my particular, there are 2 possible characteristics that can reveal an account name as suspicious, gibberish or both:

Weird/random spelling in name or name consists of purely or mostly numbers. Examples of account names that fit these criteria are: 128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds.
The name has 2 components (let's assume that no name will ever have more than 2 components) and the spelling and pronounciation of the 2 components are very similar. Examples of account names that fit these criteria are: Jala Haja, Hata Yaha, Faja Kaja.

If an account name meets both of the above criteria (i.e. 'asdfs lsdfs', '332 333') it should also be considered suspicious.

On the other hand, a legitimate account name doesn't need to have both first name and last name. They are usually names from popular languages such as Roman/Latin (i.e. Spanish, German, Portuguese, French, English), Chinese, and Japanese.

Examples of legitimate account names include (these names are made up but do reflect similar styles of legitimate account names in real world): Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng.

I've seen some slightly similar questions on Stackoverflow that asks for ways to detect gibberish texts. But those don't fit my situation because legitimate texts and words actually have meanings, whereas human names usually don't. I also want to be able to do it just based on account names and nothing else.

Right now my script takes care of finding the 2nd characteristic of suspicious account names (similar components in name) using Python's Fuzzy Wuzzy package and using 50% as the similarity threshold. The script is listed below:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

import pandas as pd
import numpy as np

accounts = pd.read_csv('dataset_with_names.csv', encoding = 'ISO-8859-1', sep=None, engine='python').replace(np.nan, 'blank', regex=True)

pd.options.mode.chained_assignment = None

accounts.columns = ['name', 'email', 'akon_id', 'acct_creation_date', 'first_time_city', 'first_time_ip', 'label']

accounts['name_simplified']=accounts['name'].str.replace('[^\w\s]','')
accounts['name_simplified']=accounts['name_simplified'].str.lower()

sim_name = []

for index, row in accounts.iterrows():        
    if ' ' in row['name_simplified']:
        row['name_simplified']=row['name_simplified'].split()
        if len(row['name_simplified']) > 1:
            #print(row['name_simplified'])
            if fuzz.ratio(row['name_simplified'][0], row['name_simplified'][1]) >= 50:
                sim_name.append('True')
            else:
                sim_name.append('False')
        else:
            sim_name.append('False')
    else:
        sim_name.append('False')        

accounts['are_name_components_similar'] = sim_name

The result has been reliable for what the script was designed to do, but I also want to be able to surface gibberish account names with the 1st characteristic (weird/random spelling or name consists of purely or mostly numbers). So far I have not found a solution to that yet.

Can anyone help? Any feedback/suggestion will be greatly appreciated!

Solution

For the 1st characteristic, you can train a character-based n-gram language model, and treat all names with low average per-character probability as suspicious.

A quick-and-dirty example of such language model is below. It is a mixture of 1-gram, 2-gram and 3-gram language models, trained on a Brown corpus. I am sure you can find more relevant training data (e.g. list of all names of actors).

from nltk.corpus import brown
from collections import Counter
import numpy as np

text = '\n  '.join([' '.join([w for w in s]) for s in brown.sents()])

unigrams = Counter(text)
bigrams = Counter(text[i:(i+2)] for i in range(len(text)-2))
trigrams = Counter(text[i:(i+3)] for i in range(len(text)-3))

weights = [0.001, 0.01, 0.989]

def strangeness(text):
    r = 0
    text = '  ' + text + '\n'
    for i in range(2, len(text)):
        char = text[i]
        context1 = text[(i-1):i]
        context2 = text[(i-2):i]
        num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2] 
        den = sum(unigrams.values()) * weights[0] + unigrams[context1] * weights[1] + bigrams[context2] * weights[2]
        r -= np.log(num / den)
    return r / (len(text) - 2)

Now you can apply this strangeness measure to your examples.

t1 = '128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds'.split(', ')
t2 = 'Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng'.split(', ')
for t in t1 + t2:
    print('{:20} -> {:9.5}'.format(t, strangeness(t)))

You see that gibberish names are in most cases more "strange" than normal ones. You could use for example a threshold of 3.9 here.

128                  ->    5.5528
127                  ->    5.6572
h4rugz4sx383a6n64hpo ->    5.9016
tt                   ->    4.9392
t66                  ->    6.9673
t65                  ->    6.8501
asdfds               ->    3.9776
Michael              ->    3.3598
sara                 ->    3.8171
jose colmenares      ->    2.9539
Dimitar              ->    3.4602
Jose Rafael          ->    3.4604
Morgan               ->    3.3628
Eduardo Medina       ->    3.2586
Luis R. Mendez       ->     3.566
Hikaru               ->    3.8936
SELENIA              ->    6.1829
Zhang Ming           ->    3.4809
Xuting Liu           ->    3.7161
Chen Zheng           ->    3.6212

Of course, a simpler solution is to collect a list of popular names in all your target languages and use no machine learning at all - just lookups.