amazon-web-services amazon-s3 aws-lambda amazon-rds aws-glue

Best strategy to consume large amounts of third-party API data using AWS?

I'm relatively new to AWS and am working through strategies that best support a specific business requirement of the service we're developing.

Among our challenges:

We need to pull in a very large data set (hundreds of thousands of records) from a third-party API, which delivers paginated records in max groups of 50;
We need to assign two unique, internal keys to each of the imported records;
We need to update the imported records by making regularly scheduled calls for updated and new records; and
In time, we will be adding records from additional sources - and will need to reconcile (match, duplicates) data from multiple sources.

Our goal is to keep this data separate from our main database, so that we can independently manage and query it as needed. As such, the strategy we've been considering is:

Making the API calls in a recursive lambda function (due to pagination);
Storing the results of the call in an S3 bucket as a single - or multiple - json files;
Pulling the S3 data into a non-relational DB.

We're having some problems here though:

Given that the initial import will take several hours, our lambda is timing out at 15 min (hard limit);
What is the best way to assign our own unique keys to the incoming data (one key should ideally be generated by taking incoming data and reformatting it to our needs); and
What is the best strategy to update these records with updated information from the source or a third party?

Not sure if this should be broken out into three questions - but this seems to be the best way to provide the full context.

Solution

A few comments.

We need to pull in a very large data set (hundreds of thousands of records) from a third-party API, which delivers paginated records in max groups of 50;

That means around "thousands" of calls to the third-party API. Elsewhere in the question you mention "several hours". Is this load OK with the provider with that API? Just one thing to consider, you case you haven't.

Making the API calls in a recursive lambda function (due to pagination);

Be extremely careful with recursive Lambda Function calls, i.e., a Lambda Function that asynchronously invokes itself. It may happen that due to a bug the Lambda won't ever stop calling itself, and then you get into an endless loops of Lambda Calls and increasing charges... It can be stopped, but it's a PITA.

Storing the results of the call in an S3 bucket as a single - or multiple - json files;

If you want to use S3, I'd probably suggest storing the data aggregated into fewer files. You didn't mention the size of each piece of data, but tons of tiny files isn't ideal for S3. On the other hand, just a single gigantic (e.g., high tens or hundreds of GB or more) isn't ideal for later processing (even though S3 would deal with it without any issues).

Two things I'd suggest you to investigate:

Step Functions.

Since you'll need to deal with pagination of the 3rd party API, you could define a state machine in Step Functions that will invoke your Lambda for you. The Lambda will do its thing (download a bunch of records, store them somewhere) and return either the number of downloaded records, or the number of pending records, something like that. Then the State Machine of Step Functions will be responsible for the logic of deciding whether to call the Download Lambda again (maybe even with parameters based on the value returned by the prior call) or if it's done.

This way, you have good separation of concerns: a super specific Lambda Function, it just ingests stuff; and you separate the pagination logic (and maybe even "parallelism" or "timing" logic, if you for some reason you're asked to "slow down" your calls to the 3rd party API).

Kinesis Firehose

Kinesis Firehose is a streaming data pipeline. Basically, you configure a firehose stream to aggregate records for you and to dump them "somewhere" (S3 is a valid target, for example). You choose how you want to aggregate (time, volume of data, for example). And you can even configure Firehose to invoke a Lambda Function to transform each record for you prior to storage (this is where you could, for example, add your 2 unique identifiers).