Search code examples
machine-learningtime-seriesrandom-forestanomaly-detection

Random cut forest for detecting anomalies in periodic time series patterns


I have time series data that measure volume of an activity by half an hour intervals. The activity has weekly periodic patterns e.g. at Monday morning the volume is highest, at weekends the volume is low, etc. I couldn't understand whether RRCF detects periodic patterns and gives a different score to a volume ,that on Monday morning would be considered normal but on Thursday morning it would be abnormal.

Of course any suggestion on any algorithm would be appreciated.


Solution

  • Technically yes, the algorithm is able to see this distinction. The reason for this is that RCF works by randomly cutting on features and trying to see which points are most "isolated" (kind of, actually the score it computes is a bit more complex). If scores on Monday are always high, then a given point will not be easily isolated, because there will be many points with the same distribution. If however there is a point on Wednesday that is particularly high, if the algorithm randomly splits on the volume and on the weekday it will most likely be able to see that the points stands alone.

    However, it is important to give the algorithm the means to split well. In particular it is not trivial to split over week days, which is a categorical variable.
    The best encoding for this kind of variable would be a sine-cosine encoding, transforming weekday into two sine/cosine variables, so that the algorithm can easily split on the days while maintaining the notion of distance between them (i.e. Monday is as close to Sunday as it is to Tuesday), which you would lose via a Label Encoder.

    If the encoding is not clear, try reading this, it should explain the concept better:
    https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/