Search code examples
amazon-web-servicesamazon-kinesisamazon-kinesis-firehoseamazon-sagemaker

AWS SageMaker Random Cut Forest or Kinesis Data Analytics Random Cut Forest?


I need to put together an architecture that can detect anomalies in logs created by a web application.

The Random Cut Forest algorithm constantly pops up in my research, where it is used in two scenarios: SageMaker and Kinesis Data Analytics.

Which of these two services should I use in my architecture?


Solution

  • At the core, the mathematical methodology between the two is nearly identical, but there are some differences in how they are implemented within Kinesis and SageMaker that should help drive your decision.

    Kinesis RandomCutForest:

    • Streaming version of the algorithm which is great for near-real-time updates to the model.
    • Supports time decay of older records, shingling of the input data, and if you are using multiple dimensions, anomaly attribution that helps you understand the effect of each of the dimensions.
    • So, in case your logs are being stored in CloudWatch, by using subscription filters (and Lambda if needed) you can get them preprocessed and sent to Kinesis with little effort.

    SageMaker RandomCutForest:

    • Batch version of the algorithm, great for large datasets (typically stored in S3) or where there's no need to update the model frequently.
    • Similar to Kinesis, supports near-real-time scoring of incoming data points via inference endpoint, but new data points do not change the underlying model.
    • Supports hyper parameter optimization, which identifies the best set of parameters for your model (such as number of samples, number of trees etc.)
    • Scaling up instances for both training and scoring is straightforward, and the available SageMaker Notebooks can help you preprocess and prepare your data for training.
    • So, if your dataset is large and you don't have a need for dynamic updates to your model, SageMaker solution should be preferred solution for you.

    Hope this answers your question.