Search code examples
pythonmachine-learningnlpclassificationfuzzywuzzy

Unable to detect gibberish names using Python


I am trying to build Python model that could classify account names as either legitimate or gibberish. Capitalization is not important in this particular case as some legitimate account names could be comprised of all upper-case or all lower-case letters.

Disclaimer: this is just a internal research/experiment and no real action will be taken on the classifier outcome.

In my particular, there are 2 possible characteristics that can reveal an account name as suspicious, gibberish or both:

  1. Weird/random spelling in name or name consists of purely or mostly numbers. Examples of account names that fit these criteria are: 128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds.

  2. The name has 2 components (let's assume that no name will ever have more than 2 components) and the spelling and pronounciation of the 2 components are very similar. Examples of account names that fit these criteria are: Jala Haja, Hata Yaha, Faja Kaja.

If an account name meets both of the above criteria (i.e. 'asdfs lsdfs', '332 333') it should also be considered suspicious.

On the other hand, a legitimate account name doesn't need to have both first name and last name. They are usually names from popular languages such as Roman/Latin (i.e. Spanish, German, Portuguese, French, English), Chinese, and Japanese.

Examples of legitimate account names include (these names are made up but do reflect similar styles of legitimate account names in real world): Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng.

I've seen some slightly similar questions on Stackoverflow that asks for ways to detect gibberish texts. But those don't fit my situation because legitimate texts and words actually have meanings, whereas human names usually don't. I also want to be able to do it just based on account names and nothing else.

Right now my script takes care of finding the 2nd characteristic of suspicious account names (similar components in name) using Python's Fuzzy Wuzzy package and using 50% as the similarity threshold. The script is listed below:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

import pandas as pd
import numpy as np

accounts = pd.read_csv('dataset_with_names.csv', encoding = 'ISO-8859-1', sep=None, engine='python').replace(np.nan, 'blank', regex=True)

pd.options.mode.chained_assignment = None

accounts.columns = ['name', 'email', 'akon_id', 'acct_creation_date', 'first_time_city', 'first_time_ip', 'label']

accounts['name_simplified']=accounts['name'].str.replace('[^\w\s]','')
accounts['name_simplified']=accounts['name_simplified'].str.lower()

sim_name = []

for index, row in accounts.iterrows():        
    if ' ' in row['name_simplified']:
        row['name_simplified']=row['name_simplified'].split()
        if len(row['name_simplified']) > 1:
            #print(row['name_simplified'])
            if fuzz.ratio(row['name_simplified'][0], row['name_simplified'][1]) >= 50:
                sim_name.append('True')
            else:
                sim_name.append('False')
        else:
            sim_name.append('False')
    else:
        sim_name.append('False')        

accounts['are_name_components_similar'] = sim_name 

The result has been reliable for what the script was designed to do, but I also want to be able to surface gibberish account names with the 1st characteristic (weird/random spelling or name consists of purely or mostly numbers). So far I have not found a solution to that yet.

Can anyone help? Any feedback/suggestion will be greatly appreciated!


Solution

  • For the 1st characteristic, you can train a character-based n-gram language model, and treat all names with low average per-character probability as suspicious.

    A quick-and-dirty example of such language model is below. It is a mixture of 1-gram, 2-gram and 3-gram language models, trained on a Brown corpus. I am sure you can find more relevant training data (e.g. list of all names of actors).

    from nltk.corpus import brown
    from collections import Counter
    import numpy as np
    
    text = '\n  '.join([' '.join([w for w in s]) for s in brown.sents()])
    
    unigrams = Counter(text)
    bigrams = Counter(text[i:(i+2)] for i in range(len(text)-2))
    trigrams = Counter(text[i:(i+3)] for i in range(len(text)-3))
    
    weights = [0.001, 0.01, 0.989]
    
    def strangeness(text):
        r = 0
        text = '  ' + text + '\n'
        for i in range(2, len(text)):
            char = text[i]
            context1 = text[(i-1):i]
            context2 = text[(i-2):i]
            num = unigrams[char] * weights[0] + bigrams[context1+char] * weights[1] + trigrams[context2+char] * weights[2] 
            den = sum(unigrams.values()) * weights[0] + unigrams[context1] * weights[1] + bigrams[context2] * weights[2]
            r -= np.log(num / den)
        return r / (len(text) - 2)
    

    Now you can apply this strangeness measure to your examples.

    t1 = '128, 127, h4rugz4sx383a6n64hpo, tt, t66, t65, asdfds'.split(', ')
    t2 = 'Michael, sara, jose colmenares, Dimitar, Jose Rafael, Morgan, Eduardo Medina, Luis R. Mendez, Hikaru, SELENIA, Zhang Ming, Xuting Liu, Chen Zheng'.split(', ')
    for t in t1 + t2:
        print('{:20} -> {:9.5}'.format(t, strangeness(t)))
    

    You see that gibberish names are in most cases more "strange" than normal ones. You could use for example a threshold of 3.9 here.

    128                  ->    5.5528
    127                  ->    5.6572
    h4rugz4sx383a6n64hpo ->    5.9016
    tt                   ->    4.9392
    t66                  ->    6.9673
    t65                  ->    6.8501
    asdfds               ->    3.9776
    Michael              ->    3.3598
    sara                 ->    3.8171
    jose colmenares      ->    2.9539
    Dimitar              ->    3.4602
    Jose Rafael          ->    3.4604
    Morgan               ->    3.3628
    Eduardo Medina       ->    3.2586
    Luis R. Mendez       ->     3.566
    Hikaru               ->    3.8936
    SELENIA              ->    6.1829
    Zhang Ming           ->    3.4809
    Xuting Liu           ->    3.7161
    Chen Zheng           ->    3.6212
    

    Of course, a simpler solution is to collect a list of popular names in all your target languages and use no machine learning at all - just lookups.