I have an email column in a dataframe and I want to replace part of it with asterisks. I am unable to figure it out using PySpark functions.
My email column could be something like this"
What I want to achieve is this:
So essentially apart from the 1st 2 characters and the last 2 characters, I want the remaining part to be replaced by asterisks.
This is what I tried
from pyspark.sql import functions as F
split_email = F.split(df.email_address, "@")
df = df.withColumn('email_part', split_email.getItem(0))
df = df.withColumn('start', df.email_part.substr(0,2))
df = df.withColumn('end', df.email_part.substr(-2,2))
F.expr("regexp_replace(email_part, email_part[email_part.index(start)+len(start):email_part.index(end)], '*')")
I think you can achieve this with the help of following regular expression: (?<=.{2})\w+(?=.{2}@)
: Positive lookbehind for 2 characters\w+
: Any word characters(?=.{2}@)
: Positive lookahead for 2 characters followed by a literal @
First use regexp_extract
to extract this pattern from your string.
from pyspark.sql.functions import regexp_extract, regexp_replace
df = df.withColumn(
regexp_extract("email", r"(?<=.{2})\w+(?=.{2}@)", 0)
#| email|pattern|
#| abc123@gmail.com| c1|
#|123abc123@yahoo.com| 3abc1|
#| abcd@test.com| |
Then use regexp_replace
to create a replacement of *
of the same length.
df = df.withColumn(
regexp_replace("pattern", r"\w", "*")
#| email|pattern|replacement|
#| abc123@gmail.com| c1| **|
#|123abc123@yahoo.com| 3abc1| *****|
#| abcd@test.com| | |
Next use regexp_replace
again on the original email
column using the derived pattern
and replacement
To be safe, concat
the lookbehind/lookaheads from the original pattern when doing the replacment. To do this, we will have to use expr
in order to pass the column values as parameters.
from pyspark.sql.functions import concat, expr, lit
df = df.withColumn(
expr("regexp_replace(email, concat('(?<=.{2})', pattern, '(?=.{2}@)'), replacement)")
#| email|pattern|replacement| mod_email_col|
#| abc123@gmail.com| c1| **| ab**23@gmail.com|
#|123abc123@yahoo.com| 3abc1| *****|12*****23@yahoo.com|
#| abcd@test.com| | | abcd@test.com|
Finally drop the intermediate columns:
df = df.drop("pattern", "replacement")
#| email| mod_email_col|
#| abc123@gmail.com| ab**23@gmail.com|
#| abcd@test.com| abcd@test.com|
Note: I added one test case to show that this does nothing if the email address part is 4 characters or less.
Update: Here are some ways you can handle edge cases where the email address part is less than 4 characters.
The rules I am using:
patA = "regexp_replace(email, concat('(?<=.{2})', pattern, '(?=.{2}@)'), replacement)"
patB = "regexp_replace(email, concat('(?<=.{1})', pattern, '(?=.{1}@)'), replacement)"
from pyspark.sql.functions import regexp_extract, regexp_replace
from pyspark.sql.functions import concat, expr, length, lit, split, when
df.withColumn("address_part", split("email", "@").getItem(0))\
length("address_part") > 5,
regexp_extract("email", r"(?<=.{2})\w+(?=.{2}@)", 0)
regexp_extract("email", r"(?<=.{1})\w+(?=.{1}@)", 0)
"replacement", regexp_replace("pattern", r"\w", "*")
length("address_part") > 5, expr(patA)
length("address_part") > 3, expr(patB)
).otherwise(regexp_replace('email', '\w(?=@)', '*'))
).drop("pattern", "replacement", "address_part").show()
#| email| mod_email_col|
#| abc123@gmail.com| ab**23@gmail.com|
#| abcde@test.com| a***e@test.com|
#| abcd@test.com| a**d@test.com|
#| ab@test.com| a*@test.com|
#| a@test.com| *@test.com|