machine-learning artificial-intelligence classification cross-validation text-classification

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).

First, I made a Python web scraping program that counts the number of occurrences for each word.

It's a key value dictionary like

{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.

Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.

Solution

The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).

You can certainly use the RNN but there also other natural language processing techniques and much faster ones.

Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.

Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).

In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.

In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.

N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.

Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).