Search code examples
pythonapache-sparkemr

AWS EMR Spark Python Logging


I'm running a very simple Spark job on AWS EMR and can't seem to get any log output from my script.

I've tried with printing to stderr:

from pyspark import SparkContext
import sys

if __name__ == '__main__':
    sc = SparkContext(appName="HelloWorld")
    print('Hello, world!', file=sys.stderr)
    sc.stop()

And using the spark logger as shown here:

from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext(appName="HelloWorld")

    log4jLogger = sc._jvm.org.apache.log4j
    logger = log4jLogger.LogManager.getLogger(__name__)
    logger.error('Hello, world!')

    sc.stop()

EMR gives me two log files after the job runs: controller and stderr. Neither log contains the "Hello, world!" string. It's my understanding the stdout is redirected to stderr in spark. The stderr log shows that the job is accepted, run, and completed successfully.

So my question is, where can I view my script's log output? Or what should I change in my script to log correctly?

Edit: I used this command to submit the step:

aws emr add-steps --region us-west-2 --cluster-id x-XXXXXXXXXXXXX --steps Type=spark,Name=HelloWorld,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,s3a://path/to/simplejob.py],ActionOnFailure=CONTINUE

Solution

  • I've found that EMR's logging for particular steps almost never winds up in the controller or stderr logs that get pulled alongside the step in the AWS console.

    Usually I find what I want in the job's container logs (and usually it's in stdout).

    These are typically at a path like s3://mybucket/logs/emr/spark/j-XXXXXX/containers/application‌​_XXXXXXXXX/container‌​_XXXXXXX/.... You might need to poke around within the various application_... and container_... directories within containers.

    That last container directory should have a stdout.log and stderr.log.