Search code examples
amazon-web-servicesapache-sparkamazon-s3parquetamazon-athena

Are parquet files splittable when stored in AWS S3?


  • I know that parquet files are splittable if they are stored in block storage. E.g stored on HDFS
  • Are they also splittable when stored in object storage such as AWS s3?
  • This confuses me because, object storage is supposed to be atomic. You either access the entire file or none of the file. You can't even change meta data on an S3 file without rewriting the entire file. On the other hand, AWS reccomends using splittable file formats in S3 to improve the performance of Athena and other frameworks in the hadoop ecosystem.

Solution

  • Yes, Parquet files are splittable.

    S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).