Search code examples
amazon-web-servicesamazon-s3aws-glue

Using Glue to transform CSV file from S3 bucket and saving the transformed data back into another S3 bucket


Objective is to transform the data (csv files) from one S3 bucket to another S3 bucket - using Glue.

What I already tried:

  1. I created a CSV classifier.
  2. I created a crawler which scans the data coming in S3 bucket.

Where I am stuck:

  1. Unable to find how can we store the output in S3 again without saving it in any RDS or other database services.

Because Glue output is asking for database output, which I don't have and don't want to use.

Is there any way I can achieve the goal without using any other DB system, just plain - S3, Glue?

More Information

Sample single CSV file, I am trying to merge

enter image description here

Classifier with delimeter of ";"

enter image description here

Crawler Configuration

enter image description here

Crawler Result (No schema detected)

enter image description here


Solution

  • I'm assuming that all CSV files which you want to merge have the same schema. You can write the same code in Glue which you write in local Spark deployment

    Step 1: Get data from Catalog table

    val datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database_name", table_name = "table_name", transformation_ctx = "datasource0")
    

    Step 2: Convert datasource0 dynamic frame to data Frame

    val df = datasource0.toDF()
    

    Step 3: Store data frame to target s3 bucket

    df.write.format("csv").mode("append").save("s3://target-s3-path/Output")