Search code examples
scalaapache-sparkamazon-s3hiveemr

Data partitioning in s3


We have our data in relational database in single table with columns id and date as this.

productid    date    value1 value2
1         2005-10-26  24    27
1         2005-10-27  22    28
2         2005-10-26  12    18

Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. Our most frequent queries will be filtering on product id, day, month and year. So trying to load the data partitions in a way to have better query performance.

From what i understood, I can create the partitions like this

s3://my-bucket/my-dataset/dt=2017-07-01/   
...
s3://my-bucket/my-dataset/dt=2017-07-09/   
s3://my-bucket/my-dataset/dt=2017-07-10/

or like this,

s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
  1. Which will be faster in terms of query as I have 7 years data.
  2. Also, how can i add partitioning for product id here? So that it will be faster.
  3. How can i create this (s3://mybucket/year=2017/month=06/day=01/) folder structures with key=value using spark scala.? Any examples?

Solution

  • We partitioned like this,

    s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/

    s3://bucketname/2018/03/01/11/30/nest/e1/e1a

    minute is rounded to 30 mins. If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day.

    It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration.

    Hope it helps.