I have a dataset stored in AWS S3 and query using AWS Athena. The data is currently structured in two columns in S3 as follows:
Dataset
Customer_ID | Products_csv
C1 | P1, P2, P3
C2 | P2, P4
With additional requirement, one more field needs to be added, which is quantity of product and this is how I planned on storing it :
Customer_ID | Product:Quantity(csv)
C1 | P1:3, P2:1, P3:5
C2 | P2:2, P4:7
If the above format is followed for table in athena, it would not be simpler to perform the required operations on data. For example, if only products with quantity(>=3) has to be retrieved per customer (which is P1,P3 for customer1 and P4 for customer2), the query would be complex with above structure or might not scale for huge amount of data.
Data size
Any ideas on how to store the product & quantity column data in a alternative format that is optimum for querying would be greatly appreciated.
A more-traditional format would be separate rows for each relationship.
For example:
Customer_ID, Product, Quantity
C1, P1, 3
C1, P2, 1
C1, P3, 5
C2, P2, 2
C2, P4, 7
Millions of rows are no problem. The simpler format would be easier for Athena to process.
You would get much better performance if the data is stored in Parquet format. It is possible to use Athena to convert the data into Parquet, but start by using CSV to prove that everything works.