hadoop amazon-web-services amazon-s3 mapreduce elastic-map-reduce

"Unable to verify integrity of data" while running MR job

I'm running a relatively big MR job using Amazon Elastic Map Reduce.

I ran the job plenty of times on small data sets with no problem.

But when trying to run it on a large dataset I'm getting the following exception:

Error: com.amazonaws.AmazonClientException: Unable to verify integrity of data download. Client calculated content length didn't match content length received from Amazon S3. The data may be corrupt.

I googled it and the only recommendation I got was to set the following:

System.setProperty("com.amazonaws.services.s3.disableGetObjectMD5Validation","true");

That didn't help at all.

I'm using replication 3, 11 M1Large datanodes and 1 M1Medium master node.

Any workaround or known fix for this issue?

Solution

Apparently, this is a known bug. Or so I've been told by an Amazon employee here.

It occurs when running on large datasets where an S3 object is bigger than 2GB.

I managed to work around it by moving to Hadoop 2.4.0 and AMI 3.1.0.