I have written the code for validating the email address using pyspark but getting invalid email address.
Input Email Address
alcaraz@[email protected]
Output getting
[email protected]
Expected output
"invalid email address"
code tried
df1 = df.withColumn(df.columns[0], regexp_replace(lower(df.columns[0]), "^a-zA-Z0-9@\._\-| ", ""))
extract_expr = expr(
"regexp_extract_all(emails, '(\\\w+([\\\.-]?\\\w+)*@\\[A-Za-z\-\.]+([\\\.-]?\\\w+)*(\\\.\\\w{2,3})+)', 0)")
df2 = df1.withColumn(df.columns[0], extract_expr) \
.select(df.columns[0])
There are numerous "solutions" to be found for a definitive RE that ensures conformance with RFC5322. Here's the one I use. It may not match 100% of cases.
import re
expr = r"[a-z0-9!#$%&'*+/=?^_‘{| }~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
p = re.compile(expr)
for ema in ['[email protected]', 'alcaraz@[email protected]']:
v = 'valid' if p.match(ema) else 'invalid'
print(f'{ema} is {v}')