Search code examples
regexapache-sparkpysparkdelimiter

Delimiting pyspark .read.text() with regex


I'm trying to read a text file into a PySpark dataframe. The text file has a varying amount of spaces. So a row could be something like:

Ryan A. Smith>>>Welder>>>>>>3200 Smith Street>>>>>99999

With spaces instead of arrows.

I need to delimit this, but I don't necessarily know the command to. I know they are separated always by at least 2 spaces, so regex seems perfect. However, I can't find a way to do this in PySpark.


Solution

  • We can try using split here to generate the columns you want:

    df_new = df.withColumn('name', split(df['col'], '>+').getItem(0))
               .withColumn('occupation', split(df['col'], '>+').getItem(1))
               .withColumn('address', split(df['col'], '>+').getItem(2))
               .withColumn('number', split(df['col'], '>+').getItem(3))
    

    This assumes that the current text you showed above in a column named col.