Delimiting pyspark .read.text() with regex

I'm trying to read a text file into a PySpark dataframe. The text file has a varying amount of spaces. So a row could be something like:

Ryan A. Smith>>>Welder>>>>>>3200 Smith Street>>>>>99999

With spaces instead of arrows.

I need to delimit this, but I don't necessarily know the command to. I know they are separated always by at least 2 spaces, so regex seems perfect. However, I can't find a way to do this in PySpark.

Solution

We can try using split here to generate the columns you want:

df_new = df.withColumn('name', split(df['col'], '>+').getItem(0))
           .withColumn('occupation', split(df['col'], '>+').getItem(1))
           .withColumn('address', split(df['col'], '>+').getItem(2))
           .withColumn('number', split(df['col'], '>+').getItem(3))

This assumes that the current text you showed above in a column named col.

How to combine all the words of a sentence extracted with a regex?
How can I detect all symbols on mathematical operators on PHP using regex?
RegEx for combining "match everything" and "negative lookahead"
How to validate an Instagram username
Parse video id and start time from differently formatted youtube URL strings
Python - regex to grab specific lines from text
how to escape texts for formatting in python
Regex not correctly rejecting invalid music note values
regex - error with result - too many occurence
Parse a formatted, single-line string into an array of associative arrays with predefined column names
How to split a string by any of multiple delimiters?
std::regex escape special characters for use in regex
Does lookbehind work in sed?
Regex for complex delimited string with multiple parse patterns
Parsing PDF tables into csv with php
RegEx for "does not begin with"
PHP regex starting and ending with < and > produces an unknown modifier warning
Sanitize number by removing first digit then any consecutive zeros
How to match one or more characters nominated in a regex character class?
Sanitize a number which may include dollar symbols, hyphens and dots
Validate a numeric filename which may only be between 01 and 12, then the file extension
Regex for number check below a value (12)
Validate a string containing a number, underscore, then an number with a maximum of 12 digits
Regular expression that allows only a specific special character when it has a specific length
java - regex pattern matching days, hours, minutes, seconds
Regex pattern for matching words between two delimiters
Matching a String in Python using regex
How to match and extract string matching regular expression
Problem subsetting a string from a filename
Remove all non-alphanumeric characters but keep float values