Search code examples
amazon-web-servicesamazon-s3aws-glueaws-glue-data-catalog

Read CSV into AWS Glue and join with data from Data Catalogue


I'm fairly new to AWS Glue and would like to understand how to do the following:

  1. Pull a CSV file from a URL in AWS Glue
  2. Join the dataset with a column from a table I have in the Data Catalogue.
  3. Write this back to the Data Catalogue as a new table.

So far I have this:

  tableA_DF = pandas.read_cv("https://example.com/file.csv")
  tableB = glueContext.create_dynamic_frame.from_catalog(database=Z, table_name = Y)
  tableB_DF = TableB.toDF()

However, I'm not sure on how to join the two. Would like to simply add one column from tableB to tableA, and then store the result in the Data catalogue.


Solution

  • You can read directly into Dynamicframe from S3 using create_dynamic_frame_from_options.

    Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})
    

    Then read the data from catalog using as shown below :

    datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")
    

    Then you can join two frames using join and then write this back to catalogue.

    [joined_dyf][1] = Join.apply(Lookupdata,Join.apply(datasource0, Lookupdata,'someidfrom_datasource0','someidfrom_Lookupdata')
    

    Now write this back to Catalogue as table refer to this for more information:

    sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
                               enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
                               partitionKeys=["partition_key0", "partition_key1"])
    sink.setFormat("<format>")
    sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
    sink.writeFrame(last_transform)