Search code examples
machine-learninganomaly-detection

Detecting a change in behavior in one instance of a group (but not the group as a whole)


I have been reading about anomaly detection in timeseries data and understand the concept of how to use it for tracking one metric over time.

For example, say we wanted to track the number of time a person uses a website per day (e.g. John). We can use anomaly detection to detect when John's figure has spiked or dropped significantly. The metrics we would use are "John's website hits per day" and the date.

However, say I want to do this same check for lots of users, but they are all independent. The algorithm is not trying to find correlation between the users activity, but just alert us to when one users in the groups activity changes significantly. So say Johns activity is abnormally high on a certain day, we would be alerted to the anomaly.

Another example is monitoring lots of devices and detecting when one device is sending an abnormally high levels of requests per minute. Again the aim isn't to detect a correlation between all the devices sending more requests, its to alert us to the fact that one device is behaving differently from its normal pattern.

I'm not sure if this is normal anomaly detection, as it appears I would have to build a model for each of the users in the first example to detect a change. This might be feasible for a small number of users, but it seems hard to scale to a lot of users.

So I'm wondering is anomaly detection the right approach for this or are there other AI monitoring solutions / tools out there that I'm not aware off?


Solution

  • If your users have simple timeseries patterns that could be learnt by fast timeseries model, then yes you could be running a timeseries anomaly detection for each users. If your users have an intermmittent demand usage pattern then it's likely it won't perform well though.

    I would not say it's a standard approach though. Intuitively, with this technique, no model can learn the "global picture" because every model can only look at one user. Also users tend to come and go, making historical data availability and cold-start for new users a concern.

    A standard approach you can try is using outlier detection models. Represent your user sessions with features, then run a tabular outlier detection technique. An example of feature could be:

    • men/women (example of information that is not contained in usage timeseries but can help the model a lot - worth considering)
    • new user?
    • number of days since last visit
    • number of visits in the last 1 month
    • number of visits in the last 1 week
    • average number of clicks per session
    • number of clicks in this session
    • ...

    Building the good features is key there, but most often it's something you can iterate on after going to production. I'd suggest finding one or 2 features you know work well and make the model usable in real life. You can then iterate on the best features. For models, you can start by looking at sklearn models documention to build an intuition.

    One thing to know is that most real life anomaly detection models use a mix of statistical techniques and expert systems (if-else rules). Before building any model I'd suggest building a solid understanding of the dataset by doing a lot of visual explorations and distribution statistics, then try to build a simple if/else rules that can find anomalies. This is never a waste of time, because features that you will build in your if-else system will most likely be really useful to an ml model. It will also be the baseline model for benchmarking of more complex methods.

    You may want to look for fraud detection or churn detection literrature. In some contexts representing the problem as a graph problem helps. It's hard to suggest resources without more context. With machine learning/data mining tasks, understanding the exact context and constraints really helps to narrow down solution candidates. Try to think about:

    • can I afford anomaly labelling, hence unlock a breadth of supervised ML techniques? (supervised techniques are not discussed above)
    • what is the volume I need to process? at which frequency ?
    • Is historical data easy to get fast?
    • should the model detect in real-time or can it be done in batch/asynchronously
    • what is the accuracy goal? the precision goal?

    You may want to edit your question with answers to theses questions.