Search code examples
amazon-web-servicesaws-glueamazon-vpcaws-glue-data-catalog

How to connect AWS Glue to a VPC, and access private resources?


I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers).

Unfortunately, AWS Glue doesn't seem to support running inside user defined VPCs. AWS does provide something called Glue Database Connections which, when used with the Glue SDK, magically set up elastic network interfaces inside the specified VPC for Glue/Spark worker nodes. The network interfaces then tunnel traffic from Glue to a specific database inside the VPC. However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.

Is there a reliable way to setup a Glue -> VPC connection that will tunnel all traffic through a VPC?


Solution

  • You can create a database connection with NETWORK connection type and use that connection in your Glue job. It will allow your job to call a REST API or any other resource within your VPC.

    enter image description here

    https://docs.aws.amazon.com/glue/latest/dg/connection-using.html

    Network (designates a connection to a data source within an Amazon Virtual Private Cloud environment (Amazon VPC))

    enter image description here

    https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html

    To allow AWS Glue to communicate with its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC and not open it to all networks.

    enter image description here