Search code examples
mysqlscalaapache-sparkamazon-emr

AWS EMR Spark exception on jdbc datasource load


I'm spinning emr-5.31.0 image of AWS EMR cluster with Spark 2.4.6 onboard and then I'm trying to login into spark-shell on the master node and follow this tutorial https://bigdataprogrammers.com/load-data-from-mysql-in-spark-using-jdbc/ for uploading data from my RDS MySQL instance.

I've uploaded both connector jar (mysql-connector-java-5.1.49-bin.jar) as well as script to ~/home/hadoop folder.

Then I perform as described in tutorial and I'm getting 2 errors

scala> [hadoop@ip-172-31-* ~]$ spark-shell 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/10/09 16:41:31 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-172-31-*.ec2.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1602254033216_0005).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6-amzn-0
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :require /home/hadoop/mysql-connector-java-5.1.49-bin.jar
Added '/home/hadoop/mysql-connector-java-5.1.49-bin.jar' to classpath.

scala> :load /home/hadoop/test01.scala
Loading /home/hadoop/test01.scala...
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
error: error while loading package, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/execution/package.class)' has location not matching its contents: contains package object execution
error: error while loading QueryExecution, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/execution/QueryExecution.class)' has location not matching its contents: contains class QueryExecution
error: error while loading package, class file '/usr/lib/spark/jars/spark-catalyst_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/catalyst/plans/package.class)' has location not matching its contents: contains package object plans
error: error while loading LogicalPlan, class file '/usr/lib/spark/jars/spark-catalyst_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.class)' has location not matching its contents: contains class LogicalPlan
error: error while loading package, class file '/usr/lib/spark/jars/spark-catalyst_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/catalyst/encoders/package.class)' has location not matching its contents: contains package object encoders
error: error while loading ExpressionEncoder, class file '/usr/lib/spark/jars/spark-catalyst_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.class)' has location not matching its contents: contains class ExpressionEncoder
error: error while loading Expression, class file '/usr/lib/spark/jars/spark-catalyst_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/catalyst/expressions/Expression.class)' has location not matching its contents: contains class Expression
error: error while loading NamedExpression, class file '/usr/lib/spark/jars/spark-catalyst_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/catalyst/expressions/NamedExpression.class)' has location not matching its contents: contains class NamedExpression
error: error while loading DataFrameNaFunctions, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/DataFrameNaFunctions.class)' has location not matching its contents: contains class DataFrameNaFunctions
error: error while loading DataFrameStatFunctions, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/DataFrameStatFunctions.class)' has location not matching its contents: contains class DataFrameStatFunctions
error: error while loading TypedColumn, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/TypedColumn.class)' has location not matching its contents: contains class TypedColumn
error: error while loading package, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/package.class)' has location not matching its contents: contains package object function
error: error while loading ReduceFunction, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/ReduceFunction.class)' has location not matching its contents: contains class ReduceFunction
error: error while loading KeyValueGroupedDataset, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/KeyValueGroupedDataset.class)' has location not matching its contents: contains class KeyValueGroupedDataset
error: error while loading MapFunction, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/MapFunction.class)' has location not matching its contents: contains class MapFunction
error: error while loading Metadata, class file '/usr/lib/spark/jars/spark-catalyst_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/types/Metadata.class)' has location not matching its contents: contains class Metadata
error: error while loading FilterFunction, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/FilterFunction.class)' has location not matching its contents: contains class FilterFunction
error: error while loading MapPartitionsFunction, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/MapPartitionsFunction.class)' has location not matching its contents: contains class MapPartitionsFunction
error: error while loading FlatMapFunction, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/FlatMapFunction.class)' has location not matching its contents: contains class FlatMapFunction
error: error while loading ForeachFunction, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/ForeachFunction.class)' has location not matching its contents: contains class ForeachFunction
error: error while loading ForeachPartitionFunction, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/api/java/function/ForeachPartitionFunction.class)' has location not matching its contents: contains class ForeachPartitionFunction
error: error while loading StorageLevel, class file '/usr/lib/spark/jars/spark-core_2.11-2.4.6-amzn-0.jar(org/apache/spark/storage/StorageLevel.class)' has location not matching its contents: contains class StorageLevel
error: error while loading CreateViewCommand, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/execution/command/CreateViewCommand.class)' has location not matching its contents: contains class CreateViewCommand
error: error while loading DataFrameWriter, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/DataFrameWriter.class)' has location not matching its contents: contains class DataFrameWriter
error: error while loading DataStreamWriter, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/streaming/DataStreamWriter.class)' has location not matching its contents: contains class DataStreamWriter
error: error while loading SparkPlan, class file '/usr/lib/spark/jars/spark-sql_2.11-2.4.6-amzn-0.jar(org/apache/spark/sql/execution/SparkPlan.class)' has location not matching its contents: contains class SparkPlan

scala> :load /home/hadoop/test01.scala
Loading /home/hadoop/test01.scala...
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
defined object ReadDataFromJdbc

scala> ReadDataFromJdbc.main(Array("batches"))
Started.......Fri Oct 09 16:42:02 UTC 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
[Stage 0:>                                                          (0 + 1) / 1]20/10/09 16:42:04 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-20-13.ec2.internal, executor 1): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:111)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:55)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at java.lang.ClassLoader.findClass(ClassLoader.java:523)
    at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:106)
    ... 25 more

[Stage 0:>                                                          (0 + 0) / 1]20/10/09 16:42:05 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
(Connectivity Failed for Table ,org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-27-165.ec2.internal, executor 2): java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:111)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:55)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
    at java.lang.ClassLoader.findClass(ClassLoader.java:523)
    at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:106)
    ... 25 more

Driver stacktrace:)
  • first error when I'm loading scala script, it is getting loaded with some errors but repetition of same command seems to fix it
  • second error once I'm requesting data to be loaded from mysql and despite fact that mysql jdbc connector was added to classpath with a command earlier, it fails with java.lang.ClassNotFoundException: com.mysql.jdbc.Driver.

While I believe I can find some directory which will be accessible by spark to find jdbc, I'm super-confused by error appearing on load of script - why is it appearing and how it can be fixed?


Solution

  • I've ended up creating a bootstrap step for cluster which was copying mysql-connector-java jar to all nodes of cluster before spark and hadoop even installed.

    First, create copymysqljar.sh script

    #!/bin/bash
    sudo mkdir -p /home/hadoop
    sudo mkdir -p /usr/lib/spark/jars
    sudo mkdir -p /usr/lib/hadoop/lib
    aws s3 cp s3://<YOUR_BUCKET>/mysql-connector-java-5.1.49-bin.jar /home/hadoop
    chmod 777 /home/hadoop/mysql-connector-java-5.1.49-bin.jar
    sudo cp /home/hadoop/mysql-connector-java-5.1.49-bin.jar /usr/lib/spark/jars
    sudo cp /home/hadoop/mysql-connector-java-5.1.49-bin.jar /usr/lib/hadoop/lib
    
    1. save copymysqljar.sh to S3 bucket identified by s3://<YOUR_BUCKET>
    2. proceed to cluster creation in AWS with 'create cluster'-'advanced configuration'
    3. during advanced configuration on step 4 create a custom bootstrap action with s3://<YOUR_BUCKET>/copymysqljar.sh as a script
    4. start cluster creation

    Alternatively, instead of steps 3, 4 and 5 you can do the same with AWS command-line tools.

    You can reach out to official docs on bootstrap steps https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#CustomBootstrapCopyS3Object

    In general, this script takes care of everything for AWS EMR 5.31 with Hadoop, Spark and Zeppelin. Might require to copy to other directories if other tools should connect to mysql too.