I have a file where each line contains a string that represents one or more email addresses. Multiple addresses can be grouped inside curly braces as follows:
{name.surname, name2.surnam2}@something.edu
Which means both addresses name.surname@something.edu
and name2.surname2@something.edu
are valid (this format is commonly used in scientific papers).
Moreover, a single line can also contain curly brackets multiple times. Example:
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com
results in:
a.b@uni.somewhere
c.d@uni.somewhere
e.f@uni.somewhere
x.y@edu.com
z.k@edu.com
Any suggestion on how I can parse this format to extract all email addresses? I'm trying with regexes but I'm currently struggling.
Pyparsing is a PEG parser that gives you an embedded DSL to build up parsers that can read through expressions like this, with resulting code that is more readable (and maintainable) than regular expressions, and flexible enough to add afterthoughts (wait, some parts of the email can be in quotes?).
pyparsing uses '+' and '|' operators to build up your parser from smaller bits. It also supports named fields (similar to regex named groups) and parse-time callbacks. See how this all rolls together below:
import pyparsing as pp
LBRACE, RBRACE = map(pp.Suppress, "{}")
email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}@')
# define a compressed email, and assign names to the separate parts
# for easier processing - luckily the default delimitedList delimiter is ','
compressed_email = (LBRACE
+ pp.Group(pp.delimitedList(email_part))('names')
+ RBRACE
+ '@'
+ email_part('trailing'))
# add a parse-time callback to expand the compressed emails into a list
# of constructed emails - note how the names are used
def expand_compressed_email(t):
return ["{}@{}".format(name, t.trailing) for name in t.names]
compressed_email.addParseAction(expand_compressed_email)
# some lists will just contain plain old uncompressed emails too
# Combine will merge the separate tokens into a single string
plain_email = pp.Combine(email_part + '@' + email_part)
# the complete list parser looks for a comma-delimited list of compressed
# or plain emails
email_list_parser = pp.delimitedList(compressed_email | plain_email)
pyparsing parsers come with a runTests
method to test your parser against various test strings:
tests = """\
# original test string
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com
# a tricky email containing a quoted string
{x.y, z.k}@edu.com, "{a, b}"@domain.com
# just a plain email
plain_old_bob@uni.elsewhere
# mixed list of plain and compressed emails
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
"""
email_list_parser.runTests(tests)
Prints:
# original test string
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com
['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com']
# a tricky email containing a quoted string
{x.y, z.k}@edu.com, "{a, b}"@domain.com
['x.y@edu.com', 'z.k@edu.com', '"{a, b}"@domain.com']
# just a plain email
plain_old_bob@uni.elsewhere
['plain_old_bob@uni.elsewhere']
# mixed list of plain and compressed emails
{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com', 'plain_old_bob@uni.elsewhere']
DISCLOSURE: I am the author of pyparsing.