Search code examples
amazon-web-servicespysparkaws-lambdaamazon-emrlivy

Module error caused from AWS EMR by running PySpark code in Apache Livy via lambda function


I am running a pyspark code in an AWS EMR cluster. I gave the spark properties in livy application via lambda function.

import requests
import json

def lambda_handler(event, context):
  master_dns = event.get('clusterDetails', {}).get('Cluster', {}).get('MasterPublicDnsName')

  headers = { "content-type": "application/json" }

  url = "http://" + master_dns + ":8998/batches"
  print(url)
  payload = {
    "file" : "s3://dtrack-test/epay/usap/USAPPIDBAL/scripts/spark_wc.py",
    "args" : ["s3://dtrack-test/epay/usap/USAPPIDBAL/raw_data/sample-test.txt","s3://dtrack-test/epay/usap/USAPPIDBAL/sample-op/"]
  }
  res = requests.post(url, data = json.dumps(payload), headers = headers, verify = False)
  json_data = json.loads(res.text)
  return json_data 

but causing the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 49, ip-172-31-16-64.ap-south-1.compute.internal, executor 1): org.apache.spark.SparkException:
Error from python worker:
  /usr/bin/python3: Error while finding module specification for 'pyspark.daemon' (ModuleNotFoundError: No module named 'pyspark')
PYTHONPATH was:
  /mnt/yarn/usercache/livy/filecache/10/__spark_libs__1402648699103959205.zip/spark-core_2.11-2.4.5-amzn-0.jar
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout

Solution

  • I had set the configuration livy.master to local, when i removed this configuration everything worked properly.