Search code examples
pythonparsingemail-address

Extract email addresses from academic curly braces format


I have a file where each line contains a string that represents one or more email addresses. Multiple addresses can be grouped inside curly braces as follows:

{name.surname, name2.surnam2}@something.edu

Which means both addresses name.surname@something.edu and name2.surname2@something.edu are valid (this format is commonly used in scientific papers).

Moreover, a single line can also contain curly brackets multiple times. Example:

{a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com

results in:

a.b@uni.somewhere 
c.d@uni.somewhere 
e.f@uni.somewhere
x.y@edu.com
z.k@edu.com

Any suggestion on how I can parse this format to extract all email addresses? I'm trying with regexes but I'm currently struggling.


Solution

  • Pyparsing is a PEG parser that gives you an embedded DSL to build up parsers that can read through expressions like this, with resulting code that is more readable (and maintainable) than regular expressions, and flexible enough to add afterthoughts (wait, some parts of the email can be in quotes?).

    pyparsing uses '+' and '|' operators to build up your parser from smaller bits. It also supports named fields (similar to regex named groups) and parse-time callbacks. See how this all rolls together below:

    import pyparsing as pp
    
    LBRACE, RBRACE = map(pp.Suppress, "{}")
    email_part = pp.quotedString | pp.Word(pp.printables, excludeChars=',{}@')
    
    # define a compressed email, and assign names to the separate parts
    # for easier processing - luckily the default delimitedList delimiter is ','
    compressed_email = (LBRACE 
                        + pp.Group(pp.delimitedList(email_part))('names')
                        + RBRACE
                        + '@' 
                        + email_part('trailing'))
    
    # add a parse-time callback to expand the compressed emails into a list
    # of constructed emails - note how the names are used
    def expand_compressed_email(t):
        return ["{}@{}".format(name, t.trailing) for name in t.names]
    compressed_email.addParseAction(expand_compressed_email)
    
    # some lists will just contain plain old uncompressed emails too
    # Combine will merge the separate tokens into a single string
    plain_email = pp.Combine(email_part + '@' + email_part)
    
    # the complete list parser looks for a comma-delimited list of compressed 
    # or plain emails
    email_list_parser = pp.delimitedList(compressed_email | plain_email)
    

    pyparsing parsers come with a runTests method to test your parser against various test strings:

    tests = """\
        # original test string
        {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com
    
        # a tricky email containing a quoted string
        {x.y, z.k}@edu.com, "{a, b}"@domain.com
    
        # just a plain email
        plain_old_bob@uni.elsewhere
    
        # mixed list of plain and compressed emails
        {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
    """
    
    email_list_parser.runTests(tests)
    

    Prints:

    # original test string
    {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com
    ['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com']
    
    # a tricky email containing a quoted string
    {x.y, z.k}@edu.com, "{a, b}"@domain.com
    ['x.y@edu.com', 'z.k@edu.com', '"{a, b}"@domain.com']
    
    # just a plain email
    plain_old_bob@uni.elsewhere
    ['plain_old_bob@uni.elsewhere']
    
    # mixed list of plain and compressed emails
    {a.b, c.d, e.f}@uni.somewhere, {x.y, z.k}@edu.com, plain_old_bob@uni.elsewhere
    ['a.b@uni.somewhere', 'c.d@uni.somewhere', 'e.f@uni.somewhere', 'x.y@edu.com', 'z.k@edu.com', 'plain_old_bob@uni.elsewhere']
    

    DISCLOSURE: I am the author of pyparsing.