We have our data in relational database in single table with columns id and date as this.
productid date value1 value2
1 2005-10-26 24 27
1 2005-10-27 22 28
2 2005-10-26 12 18
Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. Our most frequent queries will be filtering on product id, day, month and year. So trying to load the data partitions in a way to have better query performance.
From what i understood, I can create the partitions like this
s3://my-bucket/my-dataset/dt=2017-07-01/
...
s3://my-bucket/my-dataset/dt=2017-07-09/
s3://my-bucket/my-dataset/dt=2017-07-10/
or like this,
s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
We partitioned like this,
s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/
s3://bucketname/2018/03/01/11/30/nest/e1/e1a
minute is rounded to 30 mins. If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day.
It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration.
Hope it helps.