apache-spark apache-spark-sql aws-glue pyspark

Selecting data from 100 tables in AWS Glue/PySpark

I have a catalog of 100 tables in the AWS Glue catalog of the format:

user_1
user_2
...
user_100

Each table has this format:

| id | user_id | date     | expense |
|----|---------|----------|---------|
| 1  | 1       | 20200521 | 200     |
| 2  | 2       | 20200601 | 100     |
| 3  | 1       | 20200603 | 90      |

Every table has the same schema where one column is expense (int type). Each users rows get randomly inserted in to one these 100 tables.

What I want do is to check the sum of the expense column of each of the tables for a given user.

What is the most efficient way of doing this rather than creating 100 Dynamic frames and joining them. I suppose I need to create 100 datasources for each table, but is there an easier way to select the rows for a given user from all 100 tables and get the sum?

Thank you.

Solution

You are using glue catalog, so the data must be lying in s3. So you can create a list of all the tables path on s3 and read them as one df. This way you can also apply pushdown conditions.

Path = [path1, path2 .....path100]
Df = spark.read.csv(*path)

Df.groupby('user_id').agg(sum(expense)).show()