So I have two source tables lets call the, table1
and table2
, and the destination table table3
- inside these tables there is information that needs to be extracted from columns of one table, columns of another table, and then combined to give entries of columns to the new table.
Think of it as a complex transformation; for example:
column1
extracted from table1
and complete text in column1
of table2
combined into 4 rows of column1
(depending on the JSON of column1
in table1
) in new transformed table.So it's not a 1 to 1 mapping between 1 table and another, but a 1 to many mapping where the 1 row of the source comes from a mix of one row from two source table that translates to many rows of the new destination table.
Is this something that glue jobs can accomplish? or am I better of just writing a throwaway Python script? You can assume that the size of the table is not of any concern
Provided you plan to run this process at some frequency, this is a perfect use case for Glue. If this is just a one off, Glue is also a fine choice, but Glue is primarily designed for repeated use.
In you glue script I expect you will end up joining the two tables, and then select new result columns and rows by combining your existing columns. Typically the pattern to follow would be to convert the dynamic frames (created by glue), into pyspark data frames, and then work with pyspark from there, converting back to a dynamic frame before outputting to the database.
Note that depending on your design you may not need to add rows, it of course depends on the outcome you are seeking, but Dynamo does have support for some nifty hierarchical approaches that may remove your need for multiple rows.
If you have more specific examples of schema and the outcomes you are seeking, I could show you a bit of example code.