The issue is simple. I am using the hadoop gcs-connector (https://github.com/GoogleCloudDataproc/hadoop-connectors) for writting data to google cloud stroage from a MapReduce job running in an EMR cluster (AWS). My application was writting data to s3 previously and working fine. Now I have added the gcs-connector and writting the same data to google cloud storage. But I am getting following exceptions
Error: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1072)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.close(GoogleHadoopOutputStream.java:119)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:73)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:102)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
at org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat$LazyRecordWriter.close(LazyOutputFormat.java:122)
Caused by: java.lang.OutOfMemoryError: Java heap space
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:579)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:380)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:308)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:528)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:85)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Now I have these questions
Scaling up the cluster is not an option for me. Besides I think my cluster is sufficient cause it was working with S3. What is consuming more memory here?
What can I do to reduce memory consumption by GCS uploads? any configurations? I could not find any.
It turns out that we can configure various properties to control how our files get uploaded to GCS. It boils down to reducing the buffer size that is used during uploads and also the number of concurrent file uploads that we do. I used the following configurations for my application and it worked. Not getting java heap exceptions now.
conf.set("fs.gs.outputstream.upload.buffer.size", "262144");
conf.set("fs.gs.outputstream.upload.chunk.size", "1048576");
conf.set("fs.gs.outputstream.upload.max.active.requests", "4");
Here conf is an instance of org.apache.hadoop.conf.Configuration. You can try with different values that suits your application. Cheers!