Search code examples
pythonregexdateparsingpyspark

Pyspark - Parse dates between multiple forward slashes


I have a spark dataframe with multiple columns, one of which I want to parse out dates from as a separate column. For the following two rows, the expected output would be the following:

'www.freelancer/hello/there/I/am/2024/01/03/every/woijf123oijroa.fiow.com'
'www.freelancer/camping/fun/2024/02/14/foijaoijf83747199.1.com'

Expected date output:

2024/01/03
2024/02/14
  • df.withColumn('date', split(col('website'), '/')[5]) doesn't work because the forward slashes don't follow a set pattern and even if they did, the output results in whatever is between two brackets rather than across multiple brackets.

  • Tried using locate() to find the index at which the dates start and to pull 10 values from that index, but it didn't function appropriately.


Solution

  • You could use the following regular expression:

    20[012]\d/\d{2}/\d{2}
    

    See a demo on regex101.com.