We are moving to AWS EMR/S3 and using R
for analysis (sparklyr
library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv
will pull in all the data. Is there a way in R/Python/Hive
to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
partitionBy
option of the DataFrameWriter
or correct directory structure) column of interest.bucketBy
option of the DataFrameWriter
and persistent metastore) on the column of interest.can help to narrow down the search to particular partitions in some cases, but if filter(product == p1)
is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
might be a better choice.
You should also consider choosing a better storage format (like Parquet).