Search code examples
pythonpysparkaws-glue

AWS Glue add column value as a column in another DynamicFrame


I'm new to AWS Glue and Pyspark, so I'm having some trouble with a transformation job. I have two DynamicFrames, one of them contains values in one of it's columns which needs to be added as a separate column in the other DF, and the values in the column need to be the value which corresponds a value from another column with the same id in the first table. Here's how it looks:

Table 1             Table2
+--+-----+-----+    +--+-----+-----+
|id|name |value|    |id|col1 |col2 |
+--+-----+-----+    +--+-----+-----+
| 1|name1| 10  |    | 1|str1 |val1 |
+--+-----+-----+    +--+-----+-----+
| 2|name2| 20  |    | 2|str2 |val2 |
+--+-----+-----+    +--+-----+-----+

I need the new format to be:

Table2
+--+-----+-----+-----+-----+
|id|col1 |col2 |name1|name2|
+--+-----+-----+-----+-----+
| 1|str1 |val1 | 10  |     |  <--- add 10 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+
| 2|str2 |val2 |     | 20  |  <--- add 20 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+

Solution

  • Suppose 2 dataframes are named df1 and df2.

    df3 = df1.groupBy('id').pivot('name').sum('value')
    df4 = df2.join(df3, on='id', how='inner')
    df4.show(truncate=False)