Search code examples
machine-learningcassandrabigdatamlopsfeast

Is an intermediary persistent store needed before storing features in Feast + Cassandra?


I am currently building a big data pipeline for an MLOps project, the pipeline is intended for batch processing.

This is the current setup:

  • I am storing my raw structured data in Hive.
  • Spark jobs ingest raw data and process it.
  • I am intending on using feast and Apache Cassandra as an offline store for storing computed and curated features resulting from my Spark jobs.

I want to pass data efficiently from spark jobs to feast and Cassandra, I am not sure if an intermediary data persistence solution is needed for holding processed data before passing it to feast to be stored in the offline store, is it necessary in my case?


Solution

  • It isn't necessary to store the output of the Spark jobs separately unless your use case explicitly requires it because it would otherwise just be a waste of processing and storage.

    As a side note in case you weren't already aware, the CassIO library provides a simple integration to use Cassandra as a Feast feature store with minimal boilerplate code. Feel free to try it out on your own Cassandra cluster. If you don't already have a cluster running, you can launch one in less than 2 minutes on the free-tier (no credit card required) of Astra DB -- a cloud-based Cassandra-as-a-service. Cheers!