I'm trying to read a text file into a PySpark dataframe. The text file has a varying amount of spaces. So a row could be something like:
Ryan A. Smith>>>Welder>>>>>>3200 Smith Street>>>>>99999
With spaces instead of arrows.
I need to delimit this, but I don't necessarily know the command to. I know they are separated always by at least 2 spaces, so regex seems perfect. However, I can't find a way to do this in PySpark.
We can try using split
here to generate the columns you want:
df_new = df.withColumn('name', split(df['col'], '>+').getItem(0))
.withColumn('occupation', split(df['col'], '>+').getItem(1))
.withColumn('address', split(df['col'], '>+').getItem(2))
.withColumn('number', split(df['col'], '>+').getItem(3))
This assumes that the current text you showed above in a column named col
.