Suppose I have s3://mybucket/mydata/ that has csv files that have the following columns:
color,shape,quantity,cost
and the types are:
string,string,double,double
As a contrived example, suppose I want to transform the data and dump it to s3://mybucket/mydata-transformed/ by turning the strings into upper case, and adding 2 to the doubles. So a row of, say, red,circle,2,21.7 would get turned into RED,CIRCLE,4,23.7 in the output. The following code kinda does what I want (boiler plate code omitted), where the table "mydata" has already been created for the source bucket:
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my database", table_name = "mydata", transformation_ctx = "DataSource0")
ds_df = DataSource0.toDF()
ds_df.select("color","shape","quantity","cost").show()
ds_df1 = ds_df.select(upper(col('color')),upper(col('shape')),col('quantity')+2,col('cost')+2)
Transform0 = DynamicFrame.fromDF(ds_df1, glueContext, "Transform0")
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = Transform0, connection_type = "s3", format = "json", connection_options = {"path": "s3://mybucket/mydata-transformed/", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()
Below is the resulting json for the sample data above:
{"upper(color)":"RED","upper(shape)":"CIRCLE","(quantity + 2)":4.0,"(cost + 2)":23.7}
The data has been correctly transformed. However, the column names are now "upper(color)","upper(shape)","(quantity + 2)","(cost + 2)". How do I make it so that the resulting column names are color,shape,quantity,cost?
To solve the problem you can use alias
. Check the full example below:
import pyspark.sql.functions as f
jsonStr = """{ "color": "red", "shape": "square","quantity":4,"cost":"11.11" }"""
df = spark.read.json(sc.parallelize([jsonStr]))
df.show()
+-----+-----+--------+------+
|color| cost|quantity| shape|
+-----+-----+--------+------+
| red|11.11| 4|square|
+-----+-----+--------+------+
ds_df1 = df.select(upper(col('color')).alias('color'),upper(col('shape')).alias('shape'),'quantity','cost')
ds_df1.show()
+-----+------+--------+-----+
|color| shape|quantity| cost|
+-----+------+--------+-----+
| RED|SQUARE| 4|11.11|
+-----+------+--------+-----+