Search code examples
apache-sparkpysparkjdbc

Issue in reading Pyspark Dataframe using JDBC jars in Spark Standalone


Am Using PySpark standalone application To read data from multiple RDBMS sources like Oracle and SQL Server. The 3rd party jdbc jars are called in runtime. The below code is working if I add only one jar. But not working when adding multiple jars.

Pyspark Version: 3.0.2 sample code:

conf = SparkConf()
conf.set("spark.jars", "path/to/mssql-jdbc-12.2.0.jre8.jar,path/to/snowflake-jdbc-3.13.9.jar")
spark = SparkSession.builder.appName("TEST").config(conf=conf).getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", url) \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", driver) \
    .load()

Error Msg:

py4j.protocol.Py4JJavaError: An error occurred while calling o39.load. : java.lang.NoClassDefFoundError: scala/collection/IterableOnce

I have configured the Jars in the System Environment variables.

I have also tried all options provided in the below link. https://sparkbyexamples.com/spark/add-multiple-jars-to-spark-submit-classpath/


Solution

  • Found the solution. The issue happened because one of the JAR out of the 4 Jars that were imported was not compatible with the Spark Version. Spark Version used: 3.0.2 JAR that caused the issue: spark-snowflake_2.12-2.11.2-spark_3.2 Corrected JAR: spark-snowflake_2.12-2.10.0-spark_3.2.jar

    One incorrect version of the jar had caused issue and the error message was ambiguous. Post correcting it the issue is now fixed