Search code examples
google-bigqueryazure-data-factorycryptocurrencydata-ingestion

What managed data services are recommended for importing data from Rest API to cloud storage on regular schedule (crypto)


Objective: I want an easy to manage way of ingesting data from a REST API into a cloud storage like BigQuery or similar.

Specifically: There are a number of crypto focused APIs like Glassnode, that I want to extract data in the following manners:

  1. Full backfill (historically for past X months)
  2. At required time resolution (eg. hourly or daily)
  3. Incremental addition of new data at regular schedule (same as #2 above)

I have come across a few services that look encouraging.

  1. Precog (has pre-built connectors for APIs I'm interested in. Have not trialed product yet)
  2. Azure Data Factory

But I'd like to ask folks, what are the most common/recommended data ingestion services for a use case like the above?

I'm happy to pay for the service. And I'd prioritize minimizing over head of managing data ingestion pipelines over costs.

Thanks in advance for any feedback / advice.


Solution

  • Azure Data Factory will work for this. I would say part of your decision should be based upon what you want to do with the data afterwards. For instance, if you knew you wanted to land the data in GCP, I would probably lean toward an ETL tool that works in Google Cloud. Azure Data Factory runs in Azure. If you are landing data from an API into blob storage using the public endpoint, it is a managed PaaS service that requires no extra VMs. In ADF, you can schedule things hourly/daily/whatever and parameterize your API calls to filter on a date in the API call. A couple of things to note if you go with ADF: Check out the differences between the HTTP and REST connectors. Also, if your API call returns a JSON file, think about what you want to use to parse that. ADF has dataflows that spin up a managed Spark cluster to do your transformation. That will work, but it could be expensive if not done efficiently.

    I'm not familiar with Precog. Another route I might go if I was doing this in Azure would be to use Azure Functions to do the API calls. You could also use Databricks to do this. Or you could use Databricks to call your Azure Functions and then write python or Spark SQL for your transformation steps.

    In AWS or GCP, you could also look at Matillion. Other common options in AWS include Datameer and Stitch (Talend). I'm not familiar with them enough to know if what you are asking is easy in those tools.