I have a code snippet which works great in Pandas, however my data size quite high and Pandas consumes a lot of memory. This is where I am trying to have a solution based on either PySpark or Koalas since both are Spark based and Highly scalable. Since I am new to Spark, I'm not sure how regex and replace strings can be optimized of this scale.
My Code snippet:
pd_dataset['details_trunc'] = pd_dataset['details'].str.replace(r'[0-9]+GB? ', '', regex=True, flags=re.IGNORECASE).str.replace(r'[0-9]+MB?P?S? ', '', regex=True, flags=re.IGNORECASE).str.replace(r'[0-9]+\s?mins? ', '',regex=True,flags=re.IGNORECASE).str.replace(r"\(.*\)","").str.split("$").str[0].str.split('-').str[0].str.replace(r"\b[0-9]+\b", '', regex=True).str.split('fr').str[0].str.split('ends').str[0].str.split(':').str[0].str.strip()
pd_dataset['details_trunc'].replace(to_replace =r'Apple App Store.*$', value = 'Apple App Store', regex = True, inplace=True)
pd_dataset['details_trunc'].replace(to_replace =r'Google Play.*$', value = 'Google Play', regex = True, inplace=True)
pd_dataset['details_trunc'].replace('', 'NA', inplace=True)
EDIT 1
In Below table, details
is the input and details_trunc
is output
details_trunc | details | Class |
---|---|---|
Local Airtime Call Charge | Local Airtime Call Charge | AAB |
Local Airtime Call Charge | Local Airtime Call Charge | AAB |
Local Calls | Local Calls - Incoming 0.00 | AAB |
Local Calls | Local Calls - Outgoing 0.00 | AAB |
STD Call | STD Call - E STD 020 Call Mobile No. | AAB |
v019 Call | v019 Call - $0.66 | AAB |
v019 Call | v019 Call - $8.80 | AAB |
v019 Call | v019 Call - Mobile No. $0.92 | AAB |
v019 Call | v019 Call - Mobile No. $0.25 | AAB |
v019 Call | v019 Call - Mobile No. $1.84 | AAB |
IDD Call | IDD 001 Call - E Mobile No. | AAB |
IDD Call | IDD 001 Call - IDD 001 Call - S | AAB |
Roaming Incoming Call | Roaming Incoming Call 193813 | RRE |
Roaming Incoming Call | Roaming Incoming Call 204459 | RRE |
Roaming Incoming Call | Roaming Outgoing Call 000911 Int'l Call | ILL |
Roaming Incoming Call | Roaming Outgoing Call 000954 Int'l Call(S'pore) | INL |
Roaming Incoming Call | Roaming Outgoing Call 001447 Int'l Call(S'pore) | INL |
AutoRoam Rerouted IDD/STD Call | AutoRoam Rerouted IDD/STD Call - AutoRoam Rerouted IDD Call - D -(TSM: quantity set to 0 as counted under corresponding AutoRoam Call) | AAB |
Local Mobile Data/GPRS Data | Local Mobile Data/GPRS Data (1GB = 1024MB; 1MB = 1024KB) | AAB |
Local MMS | Local MMS (M1/StarHub) | BRQ |
SmartMessage | SmartMessage (Local) | BRQ |
Global SMS | Global SMS | AKK |
Global SMS | Global SMS | AKK |
If I correctly understood the first par replacement you could write it this way in pyspark :
from pyspark.sql import functions as F
df1 = df.withColumn(
"details_trunc",
F.split(
F.regexp_replace(
"details",
r"[0-9]+MB?P?S? |\(.*\)|[0-9]+\s?mins? |[0-9]+GB? |\b[0-9]+\b",
""
),
r"\-|fr|ends|:|\$"
)[0]
)
Basically, here the function regexp_replace
removes all corresponding matches. Then using split
by regex \-|fr|ends|:|\$
and get first element.
For the second part, you can use rlike
within when
expression:
df2 = df1.withColumn(
"details_trunc",
F.when(
F.col("details_trunc").rlike(r'Apple App Store.*$'), 'Apple App Store'
).when(
F.col("details_trunc").rlike(r'Google Play.*$'), 'Google Play'
).when(
F.col("details_trunc") == '', 'NA'
).otherwise(F.col("details_trunc"))
)
I haven't tested it for all cases but this is a good starting point on how you can rewrite your pandas logic into Pyspark.