mysql jdbc apache-spark pyspark elastic-map-reduce

AWS EMR PySpark connect to mysql

I'm trying to connect via pyspark to a mysql using jdbc. I was able to do it outside EMR. But when I try in with EMR, pyspark doesn't start correctly.

The command that I used in my machine

pyspark --conf spark.executor.extraClassPath=/home/hadoop/mysql-connector-java-5.1.38-bin.jar --driver-class-path /home/hadoop/mysql-connector-java-5.1.38-bin.jar --jars /home/hadoop/mysql-connector-java-5.1.38-bin.jar

and get the following output:

16/05/18 14:29:21 INFO Client: Application report for application_1463578502297_0011 (state: FAILED)
16/05/18 14:29:21 INFO Client: 
     client token: N/A
     diagnostics: Application application_1463578502297_0011 failed 2 times due to AM Container for appattempt_1463578502297_0011_000002 exited with  exitCode: 1
For more detailed output, check application tracking page:http://ip-10-24-0-75.ec2.internal:8088/cluster/app/application_1463578502297_0011Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1463578502297_0011_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
    at org.apache.hadoop.util.Shell.run(Shell.java:456)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1463581754050
     final status: FAILED
     tracking URL: http://ip-10-24-0-75.ec2.internal:8088/cluster/app/application_1463578502297_0011
     user: hadoop
16/05/18 14:29:21 INFO Client: Deleting staging directory .sparkStaging/application_1463578502297_0011
16/05/18 14:29:21 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

I also tried using no extra jar but connecting with mariadb.jdbc with I've read is the default driver:

from pyspark.sql import SQLContext
sqlctx = SQLContext(sc)
df = sqlctx.read.format("jdbc").option("url", "jdbc:mysql://ip:port/db").option("driver", "com.mariadb.jdbc.Driver").option("dbtable", "...").option("user", "....").option("password", "...").load()

but I get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 139, in load
    return self._df(self._jreader.load())
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.load.
: java.lang.ClassNotFoundException: com.mariadb.jdbc.Driver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:45)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:45)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:45)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:120)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
    at org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

How should it be done ?

Thank you, Pedro Rosanes.

Solution

If you want to run any Spark Job on Amazon EMR 3.x or EMR 4.x, you need to do following things:

1) You can mention the spark-defaults.conf properties while bootstrapping i.e. you can change the Configuration of Driver Classpath and Executor Classpath property and also maximizeResourceAllocation (Ask for more info in comments if you need.) docs

2) You need to download all the required jars i.e. (mysql-connector.jar and mariadb-connector.jar) in your case MariaDB and MySQL connector JDBC jars to all the classpath locations like Spark, Yarn and Hadoop on all the Nodes either it is MASTER, CORE or TASK (Spark On Yarn Scenario covers the most) bootstrap scripts docs

3) And if your Spark Job is communicating only from Driver node to your Database then you may only need it use --jars and won't give you exception and works fine.

4) Also recommend you to try Master as yarn-cluster instead of local or yarn-client

In your case, if you use MariaDB or MySQL either copy your jars on $SPARK_HOME/lib, $HADOOP_HOME/lib etc. on each and every node of your cluster and then give it a try.

Later on you can use Bootstrap actions to copy your jars on all the nodes while the time of Cluster Creation.

Please comment below for more info.