amazon-web-services amazon-s3 pyspark aws-glue aws-glue-spark

Remove last delimeter from a .TXT file in Pyspark

I have an S3 file generated from the different system which is as below:

A1|~|B1|~|C1|~|D1|~|

A2|~|B2|~|C2|~|D2|~|

A3|~|B3|~|C3|~|D3|~|

A4|~|B4|~|C4|~|D4|~|

Now while reading this file in AWS Glue Pyspark script, I want to remove the last delimiter from the file. Could you please let me know how to do it?

Issue is- While trying to convert this .TXT file to parquet, when I am mentioning delimeter as '|~|' it's adding an extra column at the end. This is happening because in the source file there is an extra |~| delimeter at the end of each row.

So that's why I want to remove the last |~| delimeter from each row in the file and then convert it to parquet.

code :-

input = sc.textFile("filename.TXT").map(lambda x: x.split('|~|')) 
df=spark.createDataFrame(input,list_of_colun_names)

Solution

All you have to do is to remove the last element of the array created by your lambda function.

So you could change your lambda function to something like this.

input = sc.textFile("filename.TXT").map(lambda x: x.split('|~|')[:-1])