string algorithm machine-learning similarity feature-engineering

Learning names of spammers

Currently, some spam waves, especially when sport events happen, are flooding the internet.

As I strongly doubt that the usernames of the spammers aren't computer generated, I thought it might be interesting to try learning spammer names programatically somehow.

A user name should be between 2 and 15 characters, begin with a letter and contain only letters, numbers, _ or -.

A sample list of names would be

riazsports0171
maya34444
thelmaeatons
tigran777
newlive100
darbeshbaba
litondina10
nithuhasan
newlive100
bankuali
lldztwydni554
monomala505
nasiruddin1500
lldztwydni554
ariful3032
nazmulhasan

I do only have a fairly basic knowledge of algorithms (from university). My question is, which machine learning algorithms and/or string metrics I could use for predicting if an arbitary username is probably a spammer or not. I thought about using cosine string similaritz, because its fairly simple.

Solution

Interesting. But I don't think string similarity algorithms are the best solution.

I'd try to extract features from the names, and use a classification algorithm. SVM usually provides very good results comparing to other classification algorithms, but there are other algorithms as well (For example: Naive Bayes, Decision Tree, KNN) each with its advantages and disadvantages.

The tricky part will be to extract the features. You should be creative. Some options are: number of digits, number of consecutive letters, number of consecutive consonant, usage of capitalization, correct usage of capitalization, is matching a certain regex, ... (You could also use other features not from the string, such as number of msgs sent by this user to you, ....)

Next, you need to create a training set. This training set will contain both spammers and non-spammers user names, which are manually labeled for spammers or non-spammers.

Feed the training set to your algorithm of choice, and it will create a classifier, which you will be able to use to predict if new users are spammers or not.

You can evaluate effectiveness of each algorithm by using cross validation on your data.