Search code examples
pythonparsing

parse annotated file


I'm trying to parse a file that contains annotation of person names in the form

<name> James Gold 

</name> said to meet with <name> Mable Helen  </name> tomorrow night

I'm trying to do this with python regex but it isn't working. I'm using

annotation = re.findall(' <name>(.*)</name>', lines)

I want to recover all entries within the <name> tag but these tags could be on different lines. I tried concatenating all lines and removing newline characters but to no avail. any ideas?


Solution

  • Assuming that it is just an annotated file and not an XML file (use Acorn's solution in that case), you should use some re flags to skip the newlines and use the . better:

    >>> src = """<name> James Gold
    ... </name> said to meet with <name> Mable Helen  </name> tomorrow night"""
    >>>
    >>> [s.strip() for s in re.findall(r'<name>(.*?)</name>', src, re.DOTALL)]
    ['James Gold', 'Mable Helen']
    

    Then just strip the results to get a proper string if it happened to skip a newline. Also, your regex was missing the ? operator: therefore it was consuming everything up to the last </name> tag.