Search code examples
aws-glue

AWS Glue simple custom transformation with desired output column names


Suppose I have s3://mybucket/mydata/ that has csv files that have the following columns:

color,shape,quantity,cost

and the types are:

string,string,double,double

As a contrived example, suppose I want to transform the data and dump it to s3://mybucket/mydata-transformed/ by turning the strings into upper case, and adding 2 to the doubles. So a row of, say, red,circle,2,21.7 would get turned into RED,CIRCLE,4,23.7 in the output. The following code kinda does what I want (boiler plate code omitted), where the table "mydata" has already been created for the source bucket:

DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my database", table_name = "mydata", transformation_ctx = "DataSource0")
ds_df = DataSource0.toDF()
ds_df.select("color","shape","quantity","cost").show()
ds_df1 = ds_df.select(upper(col('color')),upper(col('shape')),col('quantity')+2,col('cost')+2)
Transform0 = DynamicFrame.fromDF(ds_df1, glueContext, "Transform0")
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, connection_type = "s3", format = "json", connection_options = {"path": "s3://mybucket/mydata-transformed/", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()

Below is the resulting json for the sample data above:

{"upper(color)":"RED","upper(shape)":"CIRCLE","(quantity + 2)":4.0,"(cost + 2)":23.7}

The data has been correctly transformed. However, the column names are now "upper(color)","upper(shape)","(quantity + 2)","(cost + 2)". How do I make it so that the resulting column names are color,shape,quantity,cost?


Solution

  • To solve the problem you can use alias. Check the full example below:

    import pyspark.sql.functions as f
    
    jsonStr = """{  "color": "red", "shape": "square","quantity":4,"cost":"11.11" }"""
    df = spark.read.json(sc.parallelize([jsonStr]))
    df.show()
    
    +-----+-----+--------+------+
    |color| cost|quantity| shape|
    +-----+-----+--------+------+
    |  red|11.11|       4|square|
    +-----+-----+--------+------+
    
    
    ds_df1 = df.select(upper(col('color')).alias('color'),upper(col('shape')).alias('shape'),'quantity','cost')
    
    ds_df1.show()
    
    +-----+------+--------+-----+
    |color| shape|quantity| cost|
    +-----+------+--------+-----+
    |  RED|SQUARE|       4|11.11|
    +-----+------+--------+-----+