Search code examples
mysqlhadoopsqoop

Hadoop/Sqoop Communications Link Error - unable to import table from mysql


I'm trying to import a table from mysql to hadoop filesystem via sqoop. However, sqoop import initiates a job and gets stuck at mapping at 0%.

After some time, the job fails with the following errors

2020-08-12 21:46:08,341 INFO mapreduce.Job: Running job: job_1597282205193_0005
2020-08-12 21:46:20,755 INFO mapreduce.Job: Job job_1597282205193_0005 running in uber mode : false
2020-08-12 21:46:20,758 INFO mapreduce.Job:  map 0% reduce 0%
2020-08-12 21:48:43,458 INFO mapreduce.Job: Task Id : attempt_1597282205193_0005_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.RuntimeException: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

For the sake of completeness, the whole command goes:

sqoop import --connect jdbc:mysql://mysqlserver:3306/DSA_ED --table test --username hadoop -P

Sqoop even creates the directory, but nothing gets imported:

[hadoop@hdpnms ~]$ hdfs dfs -ls
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2020-08-12 21:59 test
[hadoop@hdpnms ~]$ hdfs dfs -ls test/
[hadoop@hdpnms ~]$

My setup is:

  • Hadoop's namenode runs on a virtual machine (bridged network), same machine as Sqoop; Yarn as manager
  • Hadoop's datanodes run each on two virtual machines of their own (bridged as well)
  • Mysql Server runs on the host of the VM's

Though the table is really small (7 rows), I've tried increasing the RAM in the VMs (and reconfiguring the yarn-site.xml and mapred-site.xml accordingly) as suggested here) and checking the connection to the mysql server and user privileges as suggested here and here

I'm positive the mysql server's connection configuration is ok, since I can both see the database tables as a result of sqoop list-tables (which also tells me that the connect driver is ok) and connect and work on the database via the command line client mysql -u hadoop -h mysqlserver -p. This makes me think it would have to be some misconfiguration on sqoop. Yet I'm really stuck and can't make sense of what the problem might be.

EDIT 1

So, I tried using a Postgres server on the host machine and experimenting with a Mysql server on the hadoop running instance. The same issue occured both times. I think this definetly rules out connection issues from the outside, and must be something on how Sqoop manages the connections on the job.


Solution

  • In case someone stumbles into the same problem... as it turns out the issue was on IP permissions, both in the databases' configuration and in the firewall.

    I assumed all traffic would happen through the namenode, so only the namenode's IP was allowed through the firewall and into the database server, hence I could access them with other clients through there. But it turns out all the cluster's hosts were making requests during the map job.

    After giving permission to all IPs in the cluster, the job ran smoothly. I suppose there must be some configuration that routes the traffic through one machine if necessary, but this settles the question.