I'm running a relatively big MR job using Amazon Elastic Map Reduce.
I ran the job plenty of times on small data sets with no problem.
But when trying to run it on a large dataset I'm getting the following exception:
Error: com.amazonaws.AmazonClientException: Unable to verify integrity of data download. Client calculated content length didn't match content length received from Amazon S3. The data may be corrupt.
I googled it and the only recommendation I got was to set the following:
System.setProperty("com.amazonaws.services.s3.disableGetObjectMD5Validation","true");
That didn't help at all.
I'm using replication 3, 11 M1Large datanodes and 1 M1Medium master node.
Any workaround or known fix for this issue?
Apparently, this is a known bug. Or so I've been told by an Amazon employee here.
It occurs when running on large datasets where an S3 object is bigger than 2GB.
I managed to work around it by moving to Hadoop 2.4.0 and AMI 3.1.0.