I'm new to spark and my understanding is this:
Questions:
spark.driver.extraClassPath
and spark.executor.extraClassPath
. These are outdated parameters though I guess. What is the recent way to specify these jar files location?I understand I might not be making sense here and what I have mentioned above is partly just my hunch that that is how it must be happening.
So, can you please help me understand this whole business with jars and how to find and specify them?
Each library that I install that internally uses spark (or pyspark) has its own jar files
Can you tell which library are you trying to install ?
Yes, external libraries can have jars even if you are writing code in python.
Why ?
These libraries must be using some UDF (User Defined Functions). Spark runs the code in java runtime. If these UDF are written in python, then there will be lot of serialization and deserialization time due to converting data into something readable by python.
Java and Scala UDFs are usually faster that's why some libraries ship with jars.
Why could it not have sufficed to have all the code in python?
Same reason, scala/java UDFs are faster than python UDF.
What is the recent way to specify these jar files location?
You can use spark.jars.packages
property. It will copy to both driver and executor.
Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located?
https://github.com/microsoft/SynapseML#python
They have mentioned here what jars are required i.e. com.microsoft.azure:synapseml_2.12:0.9.4
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
import synapse.ml
Can you try the above snippet?