I'm relatively new to AWS and am working through strategies that best support a specific business requirement of the service we're developing.
Among our challenges:
Our goal is to keep this data separate from our main database, so that we can independently manage and query it as needed. As such, the strategy we've been considering is:
We're having some problems here though:
Not sure if this should be broken out into three questions - but this seems to be the best way to provide the full context.
A few comments.
We need to pull in a very large data set (hundreds of thousands of records) from a third-party API, which delivers paginated records in max groups of 50;
That means around "thousands" of calls to the third-party API. Elsewhere in the question you mention "several hours". Is this load OK with the provider with that API? Just one thing to consider, you case you haven't.
Making the API calls in a recursive lambda function (due to pagination);
Be extremely careful with recursive Lambda Function calls, i.e., a Lambda Function that asynchronously invokes itself. It may happen that due to a bug the Lambda won't ever stop calling itself, and then you get into an endless loops of Lambda Calls and increasing charges... It can be stopped, but it's a PITA.
Storing the results of the call in an S3 bucket as a single - or multiple - json files;
If you want to use S3, I'd probably suggest storing the data aggregated into fewer files. You didn't mention the size of each piece of data, but tons of tiny files isn't ideal for S3. On the other hand, just a single gigantic (e.g., high tens or hundreds of GB or more) isn't ideal for later processing (even though S3 would deal with it without any issues).
Two things I'd suggest you to investigate:
Since you'll need to deal with pagination of the 3rd party API, you could define a state machine in Step Functions that will invoke your Lambda for you. The Lambda will do its thing (download a bunch of records, store them somewhere) and return either the number of downloaded records, or the number of pending records, something like that. Then the State Machine of Step Functions will be responsible for the logic of deciding whether to call the Download Lambda again (maybe even with parameters based on the value returned by the prior call) or if it's done.
This way, you have good separation of concerns: a super specific Lambda Function, it just ingests stuff; and you separate the pagination logic (and maybe even "parallelism" or "timing" logic, if you for some reason you're asked to "slow down" your calls to the 3rd party API).
Kinesis Firehose is a streaming data pipeline. Basically, you configure a firehose stream to aggregate records for you and to dump them "somewhere" (S3 is a valid target, for example). You choose how you want to aggregate (time, volume of data, for example). And you can even configure Firehose to invoke a Lambda Function to transform each record for you prior to storage (this is where you could, for example, add your 2 unique identifiers).