Scale out jobs with high memory consumption but low computing power within AWS and using Docker: finding best solution

mouthful title but the point is this, I have some data science pipelines with these requirements (python based):

are orchestrated with an "internal" orchestrator based off on a server
are run across a number of users/products /etc where N could be relatively high
the "load" of this jobs I want to distribute and not be tethered by the orchestrator server
these jobs are backed by a docker image
these jobs are relatively fast to run (from 1 second to 20 seconds, post data load)
these jobs most often require considerable I/O both coming in and out.
no spark required
I want minimal hassle with scaling/provisioning/etc
data (in/out) would be stored in either a HDFS space in a cluster or AWS S3
docker image would be relatively large (encompasses data science stack)

I was trying to understand the most (a) cost-efficient but also (b) fast solution to parallelize this thing. candidates so far:

AWS ECS
AWS lambda with Container Image Support

please note for all intents and purposes scaling/computing within the cluster is not feasible

my issue is that I worry about the tradeoffs about huge data transfers (in aggregate terms), huge costs in calling docker images a bunch of times, time you would spend setting up containers in servers but very low time doing anything else, serverless management and debugging when things go wrong in case for lambda functions.

how generally are handled these kind of cases?

Solution

This is a very good question. First and foremost I would assume you are comparing Lambda to ECS/Fargate (more here for background re Fargate). While many considerations holds true for ECS/EC2, ECS/Fargate is a closer model to Lambda.

Having that said, Fargate and Lambda are different enough that it's hard to make an apple to apple comparison between the two without taking into account their different programming and execution models (event driven Vs service based). This isn't to say that you can't invoke batch jobs to run on Fargate the way you'd invoke a Lambda but 1) with this relatively short execution time (1-20 seconds) and 2) at the scale you are alluding to ... invoking a Fargate task on-demand per execution unit may be too penalizing (e.g. because of the limited granularity of the size of the task and because of the task start times in range of 30-60 seconds compared to Lambda's milliseconds). A better comparison in this case would be a Lambda invocation model per job Vs a number of running (and scalable horizontally) ECS/Fargate tasks that can support multiple jobs per task.

Something that you are not mentioning in your analysis is whether these jobs already exist or they exist and would need to be adapted for one or more of these different models (Lambda 1:1, Fargate 1:1, Fargate 1:many). Some customers may decide to stick to a specific model because they can’t afford to tweak the existing code base.

In general I would say that, if the sw needs to be created from scratch, the Lambda model with its hands-off approach seems to be a slightly better fit for this use case.

But in terms of what will be cheaper it’s a hard call to make “on theory”.