Search code examples
algorithmapache-sparkk-meansapache-spark-mllibanomaly-detection

K-Means on time series data with Apache Spark


I have a data pipeline system where all events are stored in Apache Kafka. There is an event processing layer, which consumes and transforms that data (time series) and then stores the resulting data set into Apache Cassandra.

Now I want to use Apache Spark in order train some machine learning models for anomaly detection. The idea is to run the k-means algorithm on the past data for example for every single hour in a day.

For example, I can select all events from 4pm-5pm and build a model for that interval. If I apply this approach, I will get exactly 24 models (centroids for every single hour).

If the algorithm performs well, I can reduce the size of my interval to be for example 5 minutes.

Is it a good approach to do anomaly detection on time series data?


Solution

  • I have to say that strategy is good to find the Outliers but you need to take care of few steps. First, using all events of every 5 minutes to create a new Centroid for event. I think tahat could be not a good idea.

    Because using too many centroids you can make really hard to find the Outliers, and that is what you don't want.

    So let's see a good strategy:

    1. Find a good number of K for your K-means.

      That is reall important for that, if you have too many or too few you can take a bad representation of the reality. So select a good K
    2. Take a good Training set

      So, you don't need to use all the data to create a model every time and every day. You should take a example of what is your normal. You don't need to take what is not your normal because this is what you want to find. So use this to create your model and then find the Clusters.
    3. Test it!

      You need to test if it is working fine or not. Do you have any example of what you see that is strange? And you have a set that you now that is not strange. Take this an check if it is working or not. To help with it you can use Cross Validation

    So, your Idea is good? Yes! It works, but make sure to not do over working in the cluster. And of course you can take your data sets of every day to train even more your model. But make this process of find the centroids once a day. And let the Euclidian distance method find what is or not in your groups.

    I hope that I helped you!