Search code examples
rhivepysparkamazon-emrsparklyr

Filtering data while reading from S3 to Spark


We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.

So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?


Solution

  • In short, the choice of the format is on the opposite side of the efficient spectrum.

    Using data

    • Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
    • Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.

    can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.

    Depending on the requirements:

    • A proper database.
    • Data warehouse on Hadoop.

    might be a better choice.

    You should also consider choosing a better storage format (like Parquet).