I have text in a pyspark column called TEXT
that look like below:
The sky is red. I have 2 apples and I am fine.
----------------------------------------------
The sky is back. I have 8 apples or I am fine.
----------------------------------------------
The sky is back. she has 8 apples and I am fine.
----------------------------------------------
The hill is red. I have 3 apples and I am fine
.
.
.
I want to create Regx pattern and use F.regexp_extract
function to extract the number of apples in this column. However, I am restricted in several ways:
The sky
I have
and
.Thus, I would like my Regx pattern to only extract the number of apples from the first sentence (i.e., 2 apples)
This is the Regx pattern that I came up with: regex_pattern = (?<=The sky.*?)(?<=I have )(.*?)(?= apples and)
It perfectly works via this website http://regexstorm.net/tester.
However, when I use in my Pyspark database as:
df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 0))
It raise the following Error:
SparkRuntimeException: [INVALID_PARAMETER_VALUE] The value of parameter(s) 'regexp' in regexp_extract is invalid
Can anyone let me know please how can I modify the Regx pattern to compile with Pyspark functionality?
This is my code example which raise the Error:
import pyspark.sql.functions as F
from pyspark.sql import DataFrame as SparkDataFrame
regex_pattern = '(?<=The sky.*?)(?<=I have )(.*?)(?= apples and)'
df = spark.createDataFrame(
[
('The sky is red. I have 2 apples and I am fine.', 2),
('The sky is back. I have 8 apples or I am fine.', 8),
('The sky is back. she has 8 apples and I am fine.', 8),
('The hill is red. I have 3 apples and I am fine', 3),
],
["TEXT", "APPLES"] # add your column names here
)
df = df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 0))
df.display()
First, https://pythex.org/ works better for me when I am working on Python.
If you put your regex on this site, you see your regex is invalid because look behind must have the fixed length, so you cannot have ".*" in look behind.
Also, regexp_extract
is 1-indexed. If you want to extract the 1st group, you need to pass 1
for the last argument.
regex_pattern = '(?<=The sky).*(?<=I have )(.*)(?= apples and)'
df = df.withColumn('Number_Apples', F.regexp_extract(F.col("TEXT"), regex_pattern, 1))