Search code examples
stringalgorithmmachine-learningsimilarityfeature-engineering

Learning names of spammers


Currently, some spam waves, especially when sport events happen, are flooding the internet.

As I strongly doubt that the usernames of the spammers aren't computer generated, I thought it might be interesting to try learning spammer names programatically somehow.

A user name should be between 2 and 15 characters, begin with a letter and contain only letters, numbers, _ or -.

A sample list of names would be

riazsports0171
maya34444
thelmaeatons
tigran777
newlive100
darbeshbaba
litondina10
nithuhasan
newlive100
bankuali
lldztwydni554
monomala505
nasiruddin1500
lldztwydni554
ariful3032
nazmulhasan

I do only have a fairly basic knowledge of algorithms (from university). My question is, which machine learning algorithms and/or string metrics I could use for predicting if an arbitary username is probably a spammer or not. I thought about using cosine string similaritz, because its fairly simple.


Solution

  • Interesting. But I don't think string similarity algorithms are the best solution.

    I'd try to extract features from the names, and use a classification algorithm. SVM usually provides very good results comparing to other classification algorithms, but there are other algorithms as well (For example: Naive Bayes, Decision Tree, KNN) each with its advantages and disadvantages.

    The tricky part will be to extract the features. You should be creative. Some options are: number of digits, number of consecutive letters, number of consecutive consonant, usage of capitalization, correct usage of capitalization, is matching a certain regex, ... (You could also use other features not from the string, such as number of msgs sent by this user to you, ....)

    Next, you need to create a training set. This training set will contain both spammers and non-spammers user names, which are manually labeled for spammers or non-spammers.

    Feed the training set to your algorithm of choice, and it will create a classifier, which you will be able to use to predict if new users are spammers or not.

    You can evaluate effectiveness of each algorithm by using cross validation on your data.