Search code examples
google-cloud-storageamazon-emrgoogle-cloud-dataproc

Hadoop gcs-connector throws Java heap space error


The issue is simple. I am using the hadoop gcs-connector (https://github.com/GoogleCloudDataproc/hadoop-connectors) for writting data to google cloud stroage from a MapReduce job running in an EMR cluster (AWS). My application was writting data to s3 previously and working fine. Now I have added the gcs-connector and writting the same data to google cloud storage. But I am getting following exceptions

Error: java.lang.IllegalArgumentException: Self-suppression not permitted
    at java.lang.Throwable.addSuppressed(Throwable.java:1072)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.close(GoogleHadoopOutputStream.java:119)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:73)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:102)
    at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
    at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.close(LazyOutputFormat.java:122) 
Caused by: java.lang.OutOfMemoryError: Java heap space
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:579)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:380)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:308)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:528)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:85)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Now I have these questions

Scaling up the cluster is not an option for me. Besides I think my cluster is sufficient cause it was working with S3. What is consuming more memory here?

What can I do to reduce memory consumption by GCS uploads? any configurations? I could not find any.


Solution

  • It turns out that we can configure various properties to control how our files get uploaded to GCS. It boils down to reducing the buffer size that is used during uploads and also the number of concurrent file uploads that we do. I used the following configurations for my application and it worked. Not getting java heap exceptions now.

            conf.set("fs.gs.outputstream.upload.buffer.size", "262144");
            conf.set("fs.gs.outputstream.upload.chunk.size", "1048576");
            conf.set("fs.gs.outputstream.upload.max.active.requests", "4");
    

    Here conf is an instance of org.apache.hadoop.conf.Configuration. You can try with different values that suits your application. Cheers!