I have code like
ParquetWriter<Record> writer = getParquetWriter("s3a://my_bucket/my_object_path.snappy.parquet");
for (Record r : someIterable) {
validate(r);
writer.write()
}
writer.close();
if validate
throws an exception, I want to release all resources associated with the writer. But I don't want to create any objects in S3 in that case. Is this achievable?
If I close the writer it will conclude the s3 multipart upload and create an object in the cloud. If I don't close it, the parts written so far will remain in the disk buffer, clogging up the works.
Yes it is a problem. It's been discussed in HADOOP-16906 Add some Abortable.abort() interface for streams etc which can be terminated
Problem here is it's not enough to add to the S3ABlockOutputStream
class, we'd need to pass it through the FSDataOutputStream etc, specify it in the FS APIs, define semantics if the passthrough doesn't work, commit to maintaining it etc. A lot of effort. If you do want to do that though, patches welcome...
Keep an eye on HDFS-13934, multipart upload API. This will let you do the upload and then commit/abort it. Doesn't quite fit your workflow.
Afraid you will have to go with the upload. Do remember to set a lifecycle rule for the bucket to delete old uploads, and look at the hadoop s3guard uploads
command to list/abort them too.