Search code examples
pythonregexpyspark

Regx pattern for Pyspark: match start and middle of a text and extract the middle


I have text in a pyspark column called TEXT that look like below:

The sky is red. I have 2 apples and I am fine.
----------------------------------------------
The sky is back. I have 8 apples or I am fine.
----------------------------------------------
The sky is back. she has 8 apples and I am fine.
----------------------------------------------
The hill is red. I have 3 apples and I am fine
.
.
.

I want to create Regx pattern and use F.regexp_extract function to extract the number of apples in this column. However, I am restricted in several ways:

  1. Only interested in sentences that start with The sky
  2. Only interested in the number of apples after I have
  3. Only interested in the number of apples that continues by and.

Thus, I would like my Regx pattern to only extract the number of apples from the first sentence (i.e., 2 apples) This is the Regx pattern that I came up with: regex_pattern = (?<=The sky.*?)(?<=I have )(.*?)(?= apples and) It perfectly works via this website http://regexstorm.net/tester.

However, when I use in my Pyspark database as: df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 0))

It raise the following Error: SparkRuntimeException: [INVALID_PARAMETER_VALUE] The value of parameter(s) 'regexp' in regexp_extract is invalid

Can anyone let me know please how can I modify the Regx pattern to compile with Pyspark functionality?

This is my code example which raise the Error:

import pyspark.sql.functions as F
from pyspark.sql import DataFrame as SparkDataFrame


regex_pattern = '(?<=The sky.*?)(?<=I have )(.*?)(?= apples and)'

df = spark.createDataFrame(
    [
        ('The sky is red. I have 2 apples and I am fine.', 2),
        ('The sky is back. I have 8 apples or I am fine.', 8),
        ('The sky is back. she has 8 apples and I am fine.', 8), 
        ('The hill is red. I have 3 apples and I am fine', 3),
    ],
    ["TEXT", "APPLES"]  # add your column names here
)

df = df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 0))

df.display()

Solution

  • First, https://pythex.org/ works better for me when I am working on Python.
    If you put your regex on this site, you see your regex is invalid because look behind must have the fixed length, so you cannot have ".*" in look behind.

    Also, regexp_extract is 1-indexed. If you want to extract the 1st group, you need to pass 1 for the last argument.

    regex_pattern = '(?<=The sky).*(?<=I have )(.*)(?= apples and)'
    
    df = df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 1))