I'm trying to run GeoSpark in AWS EMR cluster. The code is:
# coding=utf-8
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
from geospark.register import GeoSparkRegistrator
from geospark.utils import GeoSparkKryoRegistrator
from geospark.register import upload_jars
import config as cf
import yaml
if __name__ == "__main__":
# Read files
with open("/tmp/param.yml", 'r') as ymlfile:
param = yaml.load(ymlfile, Loader=yaml.SafeLoader)
# Register jars
upload_jars()
# Creation of spark session
print("Creating Spark session")
spark = SparkSession \
.builder \
.getOrCreate()
GeoSparkRegistrator.registerAll(spark)
I get the following error in upload_jars()
functions:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 143, in init
py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "geo_processing.py", line 21, in <module>
upload_jars()
File "/usr/local/lib/python3.7/site-packages/geospark/register/uploading.py", line 39, in upload_jars
findspark.init()
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 146, in init
"Unable to find py4j, your SPARK_HOME may not be configured correctly"
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
How can I solve this error?
You should remove upload_jars()
from your code and instead load the jars in an alternative way, either by copying them to SPARK_HOME
(at /usr/lib/spark
as of emr-4.0.0) as part of an EMR bootstrap action or in your spark-submit
command using the --jars
option.
I haven't been able to get the upload_jars()
function to work on a multi-node EMR cluster. According to the geospark documentation, upload_jars()
:
uses findspark Python package to upload jar files to executor and nodes. To avoid copying all the time, jar files can be put in directory SPARK_HOME/jars or any other path specified in Spark config files.
Spark is installed in YARN mode on EMR, meaning it is only installed on the master node, and not the core/task nodes. So, findspark
won't find Spark on the core/task nodes and so you get the error Unable to find py4j, your SPARK_HOME may not be configured correctly
.