Search code examples
amazon-s3emramazon-emr

s3-dist-cp fails with OutOfMemoryException when I upgrade from EMR 5.7 to EMR 5.8


I have been using s3-dist-cp to move compressed JSON files from S3 to HDFS as part of a bigger job. I started with EMR 5.4 and upgraded through most 5.x, I currently run a 32 machine cluster with EMR 5.7 with no problem.

When I attempted to upgrade to EMR 5.8 the s3-dist-cp job fails as shown below. Has anything changed between 5.7 and 5.8 that would cause this?

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p
kill -9 %p"
#   Executing /bin/sh -c "kill -9 11042
kill -9 11042"...
/usr/share/aws/emr/s3-dist-cp/bin/s3-dist-cp: line 55: 11042 Killed                  hadoop jar "$S3_DIST_CP_JAR" -libjars "$LIBJARS" "$@"
Traceback (most recent call last):
  ...

Solution

  • It might be too late, but yes, there was a bug on s3-dist-cp that causes on failures of s3-dist-cp jobs on emr-5.8.0 that would otherwise work on emr-5.7.0. This bug probably causes OOM on S3DistCp client because it consumes more memory when Listing of S3 objects before the MapRed job is actually submitted. it was fixed in 5.9.0.