Search code examples
apache-sparkapache-spark-sqlaws-gluepyspark

Selecting data from 100 tables in AWS Glue/PySpark


I have a catalog of 100 tables in the AWS Glue catalog of the format:

user_1
user_2
...
user_100

Each table has this format:

| id | user_id | date     | expense |
|----|---------|----------|---------|
| 1  | 1       | 20200521 | 200     |
| 2  | 2       | 20200601 | 100     |
| 3  | 1       | 20200603 | 90      |

Every table has the same schema where one column is expense (int type). Each users rows get randomly inserted in to one these 100 tables.

What I want do is to check the sum of the expense column of each of the tables for a given user.

What is the most efficient way of doing this rather than creating 100 Dynamic frames and joining them. I suppose I need to create 100 datasources for each table, but is there an easier way to select the rows for a given user from all 100 tables and get the sum?

Thank you.


Solution

  • You are using glue catalog, so the data must be lying in s3. So you can create a list of all the tables path on s3 and read them as one df. This way you can also apply pushdown conditions.

    Path = [path1, path2 .....path100]
    Df = spark.read.csv(*path)
    
    Df.groupby('user_id').agg(sum(expense)).show()