I am trying to get multiple ECS tasks to run on the same EC2 server. It is a g4dn.xlarge which has 1GPU, 4CPU, and 16GB of memory.
I am using this workaround to allow the GPU to be shared between tasks. https://github.com/aws/containers-roadmap/issues/327
However, when I launch multiple tasks, the second one gets stuck in a provisioning state until the first one finishes.
CloudWatch shows that the CPUUtilization is below 50% for the entire duration of each task.
This is my current CDK:
const taskDefinition = new TaskDefinition(this, 'TaskDefinition', {
compatibility: Compatibility.EC2
})
const container = taskDefinition.addContainer('Container', {
image: ContainerImage.fromEcrRepository(<image>),
entryPoint: ["python", "src/script.py"],
workingDirectory: "/root/repo",
startTimeout: Duration.minutes(5),
stopTimeout: Duration.minutes(60),
memoryReservationMiB: 8192,
logging: LogDriver.awsLogs({
logGroup: logGroup,
streamPrefix: 'prefix',
}),
})
const startUpScript = UserData.forLinux()
// Hack for allowing tasks to share the same GPU
// https://github.com/aws/containers-roadmap/issues/327
startUpScript.addCommands(
`(grep -q ^OPTIONS=\\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')`
)
const launchTemplate = new LaunchTemplate(this, 'LaunchTemplate', {
machineImage: EcsOptimizedImage.amazonLinux2(
AmiHardwareType.GPU
),
detailedMonitoring: false,
instanceType: InstanceType.of(InstanceClass.G4DN, InstanceSize.XLARGE),
userData: startUpScript,
role: <launchTemplateRole>,
})
const autoScalingGroup = new AutoScalingGroup(this, 'AutoScalingGroup', {
vpc: vpc,
minCapacity: 0,
maxCapacity: 1,
desiredCapacity: 0,
launchTemplate: launchTemplate,
})
const capacityProvider = new AsgCapacityProvider(this, 'AsgCapacityProvider', {
autoScalingGroup: autoScalingGroup,
})
cluster.addAsgCapacityProvider(capacityProvider)
Edit:
Issue still persists after assigning task definition the CPU and Memory amounts.
Got it working by setting both the task sizes and container sizes to less than the sum that is available on the instance. So, although the instance has 16gb RAM and 4vCPUs, there must be leftover RAM and CPU for the instance in order to assign new tasks. So 2 tasks that have 2vCPU & 8gb RAM won't work but if they both have 1vCPU and 4gb RAM, that will work.