classification of user browsing activity using machine learning

if you record all IP traffic (using wireshark or similar program) while browsing the internet, you'll find many packets sent not as part of of your browsing activity.

my question is:

if you wish to classify the packets (sent from your PC) into two groups:

1) packets sent as part of your browsing activity

2) all other packets

how would you use machine learning to solve this issue?

you can assume the packet-payload can't be used for this purpose because it's either encapsulated or encrypted, so only packet-headers can be used, e.g. TCP window size, TCP flag bits, packet length and packet directions.

Solution

Sounds like a binary classification problem.

There are three basic approaches you might use:

Collect packages you can manually label by "browsing activity" and "others" and train binary classifier on top (like SVM etc.)
Collect just packages which are "browsing activity" and train one-class classifier on top (like one class SVM)
Just collect all the data you can and try to cluster it into two clusters, there is a (very small unfortunately!) chance that the division found will be the one you are looking for

In each of the above cases you will need to prepare set of features to represent your data. So either a constant set of some features, or you might try to simply use packet header as a raw text and traing some text-based model, like some convolutional neural network etc.