Search code examples
pythonregexpandaspysparkspark-koalas

PySpark based approach to inline regex matching like Pandas


I have a code snippet which works great in Pandas, however my data size quite high and Pandas consumes a lot of memory. This is where I am trying to have a solution based on either PySpark or Koalas since both are Spark based and Highly scalable. Since I am new to Spark, I'm not sure how regex and replace strings can be optimized of this scale.

My Code snippet:

pd_dataset['details_trunc'] = pd_dataset['details'].str.replace(r'[0-9]+GB? ', '', regex=True, flags=re.IGNORECASE).str.replace(r'[0-9]+MB?P?S? ', '', regex=True, flags=re.IGNORECASE).str.replace(r'[0-9]+\s?mins? ', '',regex=True,flags=re.IGNORECASE).str.replace(r"\(.*\)","").str.split("$").str[0].str.split('-').str[0].str.replace(r"\b[0-9]+\b", '', regex=True).str.split('fr').str[0].str.split('ends').str[0].str.split(':').str[0].str.strip()
pd_dataset['details_trunc'].replace(to_replace =r'Apple App Store.*$', value = 'Apple App Store', regex = True, inplace=True)
pd_dataset['details_trunc'].replace(to_replace =r'Google Play.*$', value = 'Google Play', regex = True, inplace=True)
pd_dataset['details_trunc'].replace('', 'NA', inplace=True)

EDIT 1

In Below table, details is the input and details_trunc is output

details_trunc details Class
Local Airtime Call Charge Local Airtime Call Charge AAB
Local Airtime Call Charge Local Airtime Call Charge AAB
Local Calls Local Calls - Incoming 0.00 AAB
Local Calls Local Calls - Outgoing 0.00 AAB
STD Call STD Call - E STD 020 Call Mobile No. AAB
v019 Call v019 Call - $0.66 AAB
v019 Call v019 Call - $8.80 AAB
v019 Call v019 Call - Mobile No. $0.92 AAB
v019 Call v019 Call - Mobile No. $0.25 AAB
v019 Call v019 Call - Mobile No. $1.84 AAB
IDD Call IDD 001 Call - E Mobile No. AAB
IDD Call IDD 001 Call - IDD 001 Call - S AAB
Roaming Incoming Call Roaming Incoming Call 193813 RRE
Roaming Incoming Call Roaming Incoming Call 204459 RRE
Roaming Incoming Call Roaming Outgoing Call 000911 Int'l Call ILL
Roaming Incoming Call Roaming Outgoing Call 000954 Int'l Call(S'pore) INL
Roaming Incoming Call Roaming Outgoing Call 001447 Int'l Call(S'pore) INL
AutoRoam Rerouted IDD/STD Call AutoRoam Rerouted IDD/STD Call - AutoRoam Rerouted IDD Call - D -(TSM: quantity set to 0 as counted under corresponding AutoRoam Call) AAB
Local Mobile Data/GPRS Data Local Mobile Data/GPRS Data (1GB = 1024MB; 1MB = 1024KB) AAB
Local MMS Local MMS (M1/StarHub) BRQ
SmartMessage SmartMessage (Local) BRQ
Global SMS Global SMS AKK
Global SMS Global SMS AKK

Solution

  • If I correctly understood the first par replacement you could write it this way in pyspark :

    from pyspark.sql import functions as F
    
    df1 = df.withColumn(
        "details_trunc",
        F.split(
            F.regexp_replace(
                "details",
                r"[0-9]+MB?P?S? |\(.*\)|[0-9]+\s?mins? |[0-9]+GB? |\b[0-9]+\b",
                ""
            ),
            r"\-|fr|ends|:|\$"
        )[0]
    )
    

    Basically, here the function regexp_replace removes all corresponding matches. Then using split by regex \-|fr|ends|:|\$ and get first element.

    For the second part, you can use rlike within when expression:

    df2 = df1.withColumn(
        "details_trunc",
        F.when(
            F.col("details_trunc").rlike(r'Apple App Store.*$'), 'Apple App Store'
        ).when(
            F.col("details_trunc").rlike(r'Google Play.*$'), 'Google Play'
        ).when(
            F.col("details_trunc") == '', 'NA'
        ).otherwise(F.col("details_trunc"))
    )
    

    I haven't tested it for all cases but this is a good starting point on how you can rewrite your pandas logic into Pyspark.