Search code examples
apache-sparkamazon-s3rdd

Can data be distributed to different nodes when Spark reads a large file from S3


Suppose I have a large data file on S3 and want to load it to Spark cluster to perform some data processing. When I use sc.textFile(filepath) to load the file into RDD, will each node in my cluster store portion of my file RDD and distribute over nodes? Or the whole data file will be stored in one node and replicate over the cluster? What if the file size is larger than the memory of that node?

Thanks!


Solution

  • There's no locality with S3, so Spark can schedule the work anywhere there's space.

    However, it will only break the file up for processing if it can be split. Avro, ORC, parquet, CSV all split nicely if not compressed with gz. text files? Not AFAIK