I have a catalog of 100 tables in the AWS Glue catalog of the format:
user_1
user_2
...
user_100
Each table has this format:
| id | user_id | date | expense |
|----|---------|----------|---------|
| 1 | 1 | 20200521 | 200 |
| 2 | 2 | 20200601 | 100 |
| 3 | 1 | 20200603 | 90 |
Every table has the same schema where one column is expense (int type). Each users rows get randomly inserted in to one these 100 tables.
What I want do is to check the sum of the expense column of each of the tables for a given user.
What is the most efficient way of doing this rather than creating 100 Dynamic frames and joining them. I suppose I need to create 100 datasources for each table, but is there an easier way to select the rows for a given user from all 100 tables and get the sum?
Thank you.
You are using glue catalog, so the data must be lying in s3. So you can create a list of all the tables path on s3 and read them as one df. This way you can also apply pushdown conditions.
Path = [path1, path2 .....path100]
Df = spark.read.csv(*path)
Df.groupby('user_id').agg(sum(expense)).show()