Terraform AWS & EMR run : VPC destruction issue

I have a Terraform project deploying VPC, buckets, lambdas and databases. One of my lambda creates an EMR cluster using the boto3 Python lib.

s3 = boto3.client('s3')
session = boto3.session.Session()
client = session.client('emr', region_name=os.environ['region'])
job_id = client.run_job_flow(
            Name=os.environ['emr_name'],
            LogUri=os.environ['log_dir'],
            Instances={
                'MasterInstanceType': os.environ['master_instance'],
                'SlaveInstanceType': os.environ['slave_instance'],
                'InstanceCount': int(os.environ['slave_instances_count']),
                'Ec2SubnetId': os.environ['ec2_subnet_id']
            },
            ReleaseLabel='emr-5.13.0',
            Applications = [
                {'Name': 'Hadoop'},
                {'Name': 'Hive'},
                {'Name': 'Spark'}
            ],
            Steps=[
              my steps...
            ],
            BootstrapActions=[
                 ...
            ],
            ServiceRole='EMR_DefaultRole',
            JobFlowRole='EMR_EC2_DefaultRole',
            VisibleToAllUsers=True
        )

Everything works well, I'm satisfied.

But when I go for the terraform destroy command, it fails to delete the VPC Terraform created (fails on timeout).

I guess this is due to the 2 Security Groups created for the EMR run. Terraform is not aware of them and hence cannot delete the VPC because the 2 SG are in this VPC.

Having a look on the AWS web console, I can see the 2 SG :

sg-xxxxxxx
ElasticMapReduce-slave
eouti-vpc
Slave group for Elastic MapReduce created on <date>

sg-xxxxxxx
ElasticMapReduce-master
eouti-vpc
Master group for Elastic MapReduce created on <date>

How can I solve it ? Should Terraform create these SG ? What would you advice?

Cheers

Solution

When building Terraform that interfaces with other components (such as using direct API calls, or another Terraform run sharing info with terraform_remote_state), you have to be aware of dependencies between the different components. It's an art to determining where the separation belongs, much like writing other types of software.

In the documentation for run_job_flow(**kwargs), it looks like you can pass in the several types of security groups:

'EmrManagedMasterSecurityGroup': 'string',
'EmrManagedSlaveSecurityGroup': 'string',
'ServiceAccessSecurityGroup': 'string',
'AdditionalMasterSecurityGroups': [
    'string',
],
'AdditionalSlaveSecurityGroups': [
    'string',
]

I'm not sure how these security groups are being created for you now (the docs don't mention creating temp security groups, and it's odd that if they did then they would be left hanging around afterwards), but it seems logical to do all of your networking in Terraform and then output those IDs for consumption by your Lambda. This is the method Krishna mentioned.

I like this because it encapsulates the networking infrastructure in 1 place (Terraform), and has clear inputs/outputs to code that uses that resource (your run_job_flow Lambda).

There doesn't seem to be strong need to create temporary security groups, but if you did I would create/destroy them in boto around your existing code. It depends if you feel that the security groups are part of the networking infrastructure, or part of the job. That should guide where you create/destroy them.