Search code examples
pythonapache-sparkgoogle-cloud-platformgoogle-cloud-dataflowgoogle-cloud-dataproc

Google Cloud - What products for time series data cleaning?


I have around 20TB of time series data stored in big query.

The current pipeline I have is:

raw data in big query => joins in big query to create more big query datasets => store them in buckets

Then I download a subset of the files in the bucket:

Work on interpolation/resampling of data using Python/SFrame, because some of the time series data have missing times and they are not evenly sampled.

However, it takes a long time on a local PC, and I'm guessing it will take days to go through that 20TB of data.


Since the data are already in buckets, I'm wondering what would the best Google tools for interpolation and resampling?

After resampling and interpolation I might use Facebook's Prophet or Auto ARIMA to create some forecasts. But that would be done locally.


There's a few services from Google that seems are like good options.

  1. Cloud DataFlow: I have no experience in Apache Beam, but it looks like the Python API with Apache Beam have missing functions compared to the Java version? I know how to write Java, but I'd like to use one programming language for this task.

  2. Cloud DataProc: I know how to write PySpark, but I don't really need any real time processing or stream processing, however spark has time series interpolation, so this might be the only option?

  3. Cloud Dataprep: Looks like a GUI for cleaning data, but it's in beta. Not sure if it can do time series resampling/interpolation.

Does anyone have any idea which might best fit my use case?

Thanks


Solution

  • I would use PySpark on Dataproc, since Spark is not just realtime/streaming but also for batch processing.

    You can choose the size of your cluster (and use some preemptibles to save costs) and run this cluster only for the time you actually need to process this data. Afterwards kill the cluster.

    Spark also works very nicely with Python (not as nice as Scala) but for all effects and purposes the main difference is performance, not reduced API functionality.

    Even with the batch processing you can use the WindowSpec for effective time serie interpolation

    To be fair: I don't have a lot of experience with DataFlow or DataPrep, but that's because out use case is somewhat similar to yours and Dataproc works well for that