python string spam-prevention pattern-recognition

Calculating probability that a string has been randomized? - Python

this is correlated to a question I asked earlier (question)

I have a list of manually created strings such as:

lucy87

gordan_king

fancy_unicorn77

joplucky_kanga90

base_belong_to_narwhals

and a list of randomized strings:

johnkdf

pancake90kgjd

fancy_jagookfk

manhattanljg

What gives away that the last set of strings are randomized is that sequences such as 'kjg', 'jgf', 'lkd', ... .

Any clever way I could separate strings that contain these apparently randomized strings from the crowd?

I guess that this plays a lot on the fact that certain characters are more likely to be placed next to others (e.g. 'co', 'ka', 'ja', ...).

Any ideas on this one? Kylotan mentioned Reverend, but I am not sure if it can be used fr such purpose.

Help would be much appreciated!

Solution

This is just a thought. I've never tried it myself...

Build a bloom filter from hashing every (overlapping) 4-letter sequence found in a dictionary. Test a string by counting how many 4-letter sequences in the string don't hit the filter. The more misses, the more likely it is that the word contains random junk.

Try tuning the size of the bloom filter and the number of letters per sequence.

Also note (thanks @MihaiD) that you should include a dictionary of names, preferably from multiple languages, in the bloom filter to minimise false positives.