Search code examples
xmlperl

How can I extract the text in an XML tag in Perl?


I'm trying to parse/extract data from an XML file and retrieve necessary data.

For example:

<about>
    This is an XML file
    that I want to
    extract data from
</about>
<message>Hello, this is a message.</message>
<this>Blah</this>
<that>Blahh</that>
<person> 
    <name>Jack</name>
    <age>27</name>
    <email>[email protected]</email>
</person>

I'm having trouble getting the content within the <about> tags.

This is what I have so far:

(<\w*>)[\s*]?([\s*]?.*)(<\/\w*>)/m

I'm simply trying to extract the tag name and content, which is why I have the parentheses there. i.e. ($tag = $1) =~ s/[<>]// to get the tag name, $tagcontent = $2 to get the tag's contents. I'm using \s for the white-space characters (space, tab, newline) and the ? because it may or may not occur * amount of times.

I was testing this through http://www.regexe.com/, and no luck with the matching.


Solution

  • XML is not a regular language and cannot be accurately parsed using regular expressions. Use an XML parser instead. That is guaranteed to work in all situations, and will not break if the format of the markup changes in the future.

    However, if you're absolutely sure of the format, you could get away with the following regex:

    /<(\w+)>\s*(.*?)\s*<\/\1>/s
    

    Explanation:

    • / - Starting delimiter
    • <(\w+)> - The opening tag
    • \s* - Match optional whitespace in between
    • (.*?) - Match the contents inside the tag
    • \s* - Match optional whitespace in between
    • <\/\1> - Match the closing tag. \1 here is a backreference which contains what was matched by the first capturing group.
    • / - Ending delimiter

    Note that the s modifier and m modifier are entirely different, and do different things. See this answer for more information about what each does.

    Regex101 Demo