Search code examples
apache-sparkemramazon-emr

Using Spark Is it possible to read from s3 and write from s3 without touching the disk?


Most my script are doing something like the following.

spark.read().csv("s3://")
  .filter(..).map(...)
  .write().parquet("s3://");

Is there any way to specify to spark, that I want all this work done in memory, since there are no aggregation, grouping withing my processing? This should be a simple record by record stream processor that doesn't touch the disk at all.


Solution

  • I cannot speak for EMR and its s3 connector. I can speak for Apache Hadoop itself, and the S3A connector

    We need to buffer generated data before uploading to S3. You can't do a stream() followed by a close(), because for large files you need to break the upload into 4+GB files, and even for smaller uploads, you need to deal with the common condition of the app generating data faster than you can upload to S3.

    Using transient local temp storage gives you the ability to generate data faster than your S3 upload bandwidth can handle, and cope with network errors by resending blocks.

    Apache's original s3: and s3n: clients (and s3a prior to Hadoop 2.8) all wrote the entire file to HDD before starting upload. The storage you needed was the same as the #of bytes generated, and as it was only uploaded in close(), the time for that close call is data/bandwidth.

    S3A in Hadoop 2.8+ supports fast upload (optional 2.8+, automatic in 3.0), where data is buffered to the size of a single block (5+ MB, default 64MB), with upload starting as soon as the block size is reached. This makes for faster writes, with enough bandwidth there's almost no close() delay (max: last-block-size/bandwidth). It still needs storage to cope with the mismatch between generation and upload rates, though you can instead configure it to use on heap byte arrays or off-heap byte buffers. Do that and you have to play very carefully with memory allocations, and queue sizes: you need to configure the client to block the writers when the queue of pending uploads is big enough.

    Update Johnathan Kelly @ AWS has confirmed that they do the same per block buffer and upload as the ASF S3A connector. This means that if your data generation rate in bytes/sec <= upload bandwidth from the VM then the amount of local disk needed is minimal...if you generate data faster, then you'll need more (and eventually run out of disk or reach some queue limits to block the generator threads). I'm not going to quote any numbers on actual bandwidth as it invariably improves year-on-year and any statement will soon be obsolete. For that reason, look at the age of any post of benchmarks before believing. Do your own tests with your own workloads.