amazon-web-services apache-spark pyspark aws-glue

Change the delimiter in AWS Glue Pyspark

abv_data = glueContext.create_dynamic_frame_from_options("s3", \
                   {'paths': ["s3://{}/{}".format(bucket, prefix)], \
                   "recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True},separator='\t')
        
        abv_df_1 = abv_data.toDF()
        abv_df_2 = abv_df_1.withColumn("save_date", lit(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")))
        conparms_r = glueContext.extract_jdbc_conf("reporting", catalog_id = None)

        abv_df_2.write\
          .format("com.databricks.spark.redshift")\
          .option("url", "jdbc:redshift://rs_cluster:8192/rptg")\
          .option("dbtable", redshift_schema_table_output)\
          .option("user", conparms_r['user'])\
          .option("password", conparms_r['password'])\
          .option("aws_iam_role", "arn:aws:iam::123456789:role/redshift_admin_role")\
          .option("tempdir", args["TempDir"])\
          .option("extracopyoptions","DELIMITER '\t' IGNOREHEADER 1 DATEFORMAT AS 'YYYY-MM-DD'")\
          .mode("append")\
          .save()

The csv has a tab delimiter on read, but when I add the column to the dataframe is uses a comma delimiter and is causing the Redshift load to fail.

Is there a way to add the column with a tab delimiter OR change the delimiter on the entire data frame?

Solution

This isn't necessarily the way to do this, but here is what I ended up doing:

bring the csv in with a ',' separator.

glueContext.create_dynamic_frame_from_options("s3", \
                   {'paths': ["s3://{}/{}".format(bucket, prefix)], \
                   "recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True}, separator = ',')

Then split the first column on tab and then add all the splits to their own column and add the extra column at the same time.

Drop the first column because it is still the combined column.

This gives you a comma seperated df to load.