Search code examples
apache-sparkapache-spark-sqlamazon-emrdatabricks

Connection timed out exception with spark-redshift on EMR


I am using spark-redshift library provided by data bricks to read data from a redshift table in Spark. Link: https://github.com/databricks/spark-redshift.

Note: The AWS account for the redshift cluster and the EMR cluster are different in my case.

I am able to connect to redshift using spark-redshift in Spark LOCAL mode. But the same code fails on EMR with the following exception: java.sql.SQLException: Error setting/closing connection: Connection timed out.

I have tried adding Redshift in the inbound rule on the EC2 security group of my EMR cluster but it didn't help. I had used Source as MyIP while doing this.


Solution

  • I found the solution to this using VPC peering: http://docs.aws.amazon.com/AmazonVPC/latest/PeeringGuide/Welcome.html

    We connected the redshift and the EMR VPCs using VPC peering and updated the route tables of individual VPCs to accept traffic from IPv4 CIDR of the other VPC. VPC peering can also be done across AWS accounts too. Refer to the link above to get more details.

    Once this is done, go to the VPC peering connection in both the accounts and enable DNS resolution from peer VPC. For this, select the VPC peering connection -> go to Actions option at the top -> Select Edit DNS settings -> Select Allow DNS resolution from peer VPC.