Currently, some spam waves, especially when sport events happen, are flooding the internet.
As I strongly doubt that the usernames of the spammers aren't computer generated, I thought it might be interesting to try learning spammer names programatically somehow.
A user name should be between 2 and 15 characters, begin with a letter and contain only letters, numbers, _
or -
.
A sample list of names would be
riazsports0171
maya34444
thelmaeatons
tigran777
newlive100
darbeshbaba
litondina10
nithuhasan
newlive100
bankuali
lldztwydni554
monomala505
nasiruddin1500
lldztwydni554
ariful3032
nazmulhasan
I do only have a fairly basic knowledge of algorithms (from university). My question is, which machine learning algorithms and/or string metrics I could use for predicting if an arbitary username is probably a spammer or not. I thought about using cosine string similaritz, because its fairly simple.
Interesting. But I don't think string similarity algorithms are the best solution.
I'd try to extract features from the names, and use a classification algorithm. SVM usually provides very good results comparing to other classification algorithms, but there are other algorithms as well (For example: Naive Bayes, Decision Tree, KNN) each with its advantages and disadvantages.
The tricky part will be to extract the features. You should be creative. Some options are: number of digits, number of consecutive letters, number of consecutive consonant, usage of capitalization, correct usage of capitalization, is matching a certain regex, ... (You could also use other features not from the string, such as number of msgs sent by this user to you, ....)
Next, you need to create a training set. This training set will contain both spammers and non-spammers user names, which are manually labeled for spammers or non-spammers.
Feed the training set to your algorithm of choice, and it will create a classifier, which you will be able to use to predict if new users are spammers or not.
You can evaluate effectiveness of each algorithm by using cross validation on your data.