Search code examples
pythonmachine-learningdata-scienceanomaly-detection

Anomaly detection with Python


I have to create this mechanism: I have a dataset containing the statistics of a Git repository (for example number of commits per day, number of lines of code edited per day, etc. Not more than 4 or 5 fields). I have to use an Anomaly Detection algorithm that analyzes this dataset and that launches an alert when different values than normal are detected.

For example: I launch this algorithms every end of the day, if on this day there were many more commits than usual, must trigger an alert.

I have to realize this system with Python.

From what I've read on the internet, to make this system you need to use the unsupervised machine learning. In the past few months I've been taking a machine learning course and I know how to use the Python library's Sklearn (a bit). But I'm not a real machine learning expert and I don't know what to do. Unfortunately on the internet I only find very theoretical tutorials (written by data scientist) and I do not understand what I have to do in practice.

Could someone advise me what to do and what to use?

Is there a more or less simple solution to my problem? Thanks.


Solution

  • Fit a Gaussian Mixture Model or Isolation Forest model on the data and select a threshold for what to consider an anomaly.

    As with all such problem, there is a tradeoff between recall and precision. In order to evaluate your solution, you should identify some anomalies by inspection and mark them as such. These can then be part of your validation and testing set. The training set would not contain anomalies (or only a small amount of them).