python regex pandas pyspark spark-koalas

PySpark based approach to inline regex matching like Pandas

I have a code snippet which works great in Pandas, however my data size quite high and Pandas consumes a lot of memory. This is where I am trying to have a solution based on either PySpark or Koalas since both are Spark based and Highly scalable. Since I am new to Spark, I'm not sure how regex and replace strings can be optimized of this scale.

My Code snippet:

pd_dataset['details_trunc'] = pd_dataset['details'].str.replace(r'[0-9]+GB? ', '', regex=True, flags=re.IGNORECASE).str.replace(r'[0-9]+MB?P?S? ', '', regex=True, flags=re.IGNORECASE).str.replace(r'[0-9]+\s?mins? ', '',regex=True,flags=re.IGNORECASE).str.replace(r"\(.*\)","").str.split("$").str[0].str.split('-').str[0].str.replace(r"\b[0-9]+\b", '', regex=True).str.split('fr').str[0].str.split('ends').str[0].str.split(':').str[0].str.strip()
pd_dataset['details_trunc'].replace(to_replace =r'Apple App Store.*$', value = 'Apple App Store', regex = True, inplace=True)
pd_dataset['details_trunc'].replace(to_replace =r'Google Play.*$', value = 'Google Play', regex = True, inplace=True)
pd_dataset['details_trunc'].replace('', 'NA', inplace=True)

EDIT 1

In Below table, details is the input and details_trunc is output

details_trunc	details	Class
Local Airtime Call Charge	Local Airtime Call Charge	AAB
Local Airtime Call Charge	Local Airtime Call Charge	AAB
Local Calls	Local Calls - Incoming 0.00	AAB
Local Calls	Local Calls - Outgoing 0.00	AAB
STD Call	STD Call - E STD 020 Call Mobile No.	AAB
v019 Call	v019 Call - $0.66	AAB
v019 Call	v019 Call - $8.80	AAB
v019 Call	v019 Call - Mobile No. $0.92	AAB
v019 Call	v019 Call - Mobile No. $0.25	AAB
v019 Call	v019 Call - Mobile No. $1.84	AAB
IDD Call	IDD 001 Call - E Mobile No.	AAB
IDD Call	IDD 001 Call - IDD 001 Call - S	AAB
Roaming Incoming Call	Roaming Incoming Call 193813	RRE
Roaming Incoming Call	Roaming Incoming Call 204459	RRE
Roaming Incoming Call	Roaming Outgoing Call 000911 Int'l Call	ILL
Roaming Incoming Call	Roaming Outgoing Call 000954 Int'l Call(S'pore)	INL
Roaming Incoming Call	Roaming Outgoing Call 001447 Int'l Call(S'pore)	INL
AutoRoam Rerouted IDD/STD Call	AutoRoam Rerouted IDD/STD Call - AutoRoam Rerouted IDD Call - D -(TSM: quantity set to 0 as counted under corresponding AutoRoam Call)	AAB
Local Mobile Data/GPRS Data	Local Mobile Data/GPRS Data (1GB = 1024MB; 1MB = 1024KB)	AAB
Local MMS	Local MMS (M1/StarHub)	BRQ
SmartMessage	SmartMessage (Local)	BRQ
Global SMS	Global SMS	AKK
Global SMS	Global SMS	AKK

Solution

If I correctly understood the first par replacement you could write it this way in pyspark :

from pyspark.sql import functions as F

df1 = df.withColumn(
    "details_trunc",
    F.split(
        F.regexp_replace(
            "details",
            r"[0-9]+MB?P?S? |\(.*\)|[0-9]+\s?mins? |[0-9]+GB? |\b[0-9]+\b",
            ""
        ),
        r"\-|fr|ends|:|\$"
    )[0]
)

Basically, here the function regexp_replace removes all corresponding matches. Then using split by regex \-|fr|ends|:|\$ and get first element.

For the second part, you can use rlike within when expression:

df2 = df1.withColumn(
    "details_trunc",
    F.when(
        F.col("details_trunc").rlike(r'Apple App Store.*$'), 'Apple App Store'
    ).when(
        F.col("details_trunc").rlike(r'Google Play.*$'), 'Google Play'
    ).when(
        F.col("details_trunc") == '', 'NA'
    ).otherwise(F.col("details_trunc"))
)

I haven't tested it for all cases but this is a good starting point on how you can rewrite your pandas logic into Pyspark.