Search code examples
dataframeamazon-s3pysparkaws-glueaws-glue-spark

Issues using mergeDynamicFrame on AWS Glue


I need do a merge between two dynamic frames on Glue. I tried to use the mergeDynamicFrame function, but i keep getting the same error:

AnalysisException: "cannot resolve 'id' given input columns: [];;\n'Project ['id]\n+- LogicalRDD false\n"

Right now, i have 2 DF: df_1(id, col1, salary_src) and df_2(id, name, salary)

I want to merge df_2 into df_1 by the "id" column.

df_1 = glueContext.create_dynamic_frame.from_catalog(......)
df_2 = glueContext.create_dynamic_frame.from_catalog(....)

merged_frame = df_1.mergeDynamicFrame(df_2, ["id"]) 

applymapping1 = ApplyMapping.apply(frame = merged_frame, mappings = [("id", "long", "id", "long"), ("col1", "string", "name", "string"), ("salary_src", "long", "salary", "long")], transformation_ctx = "applymapping1")

datasink2 = glueContext.write_dynamic_frame.from_options(....)

As a test i tried to pass a column from both DFs (salary and salary_src), and, the error as:

AnalysisException: "cannot resolve 'salary_src' given input columns: [id, name, salary];;\n'Project [salary#2, 'salary_src]\n+- LogicalRDD [id#0, name#1, salary#2], false\n"

Is this case, it seems to recognize the columns from the df_2 (id, name, salary).. but if i pass just one of the columns, or even the 3, it keeps failing


Solution

  • It doesn't appear to be a mergeDynamicFrame issue.

    Based on the information you provided, it looks like your df1, df2 or both are not reading data correctly and returning an empty dynamicframe which is why you have an empty list of input columns "input columns: []"

    if you are reading from s3, you must crawl your data before you can use glueContext.create_dynamic_frame.from_catalog.

    you can also include df1.show() or df1.printSchema() after you create your dynamic_frame as a troubleshooting step to make sure you are reading your data correctly before merging.