Filtering data while reading from S3 to Spark

We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.

So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?

Solution

In short, the choice of the format is on the opposite side of the efficient spectrum.

Using data

Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.

can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.

Depending on the requirements:

A proper database.
Data warehouse on Hadoop.

might be a better choice.

You should also consider choosing a better storage format (like Parquet).