Search code examples
regexregex-greedy

How to extract the data from a consecutive xml tag attribute based on the previous tag value


I have trouble getting my regex right for the below use case.

<LOB>
            <LOBStatusInfo>
                <LOB>Mobile</LOB>
                <Status>Active</Status>
            </LOBStatusInfo>
            <LOBStatusINfo>
                <LOB>Voice</LOB>
                <Status>Active</Status>
            </LOBStatusInfo>
            <LOBStatusInfo>
                <LOB>Internet</LOB>
                <Status>Disconnect</Status>
            </LOBStatusInfo>
        </LOBStatus>

In the above XML, I'm looking to extract only the status corresponding to Voice (which is active).

So far, I was able to get the LOB itself, but not the corresponding status.

ps: I'm a newbie, please pardon if the details weren't enough.


Solution

  • We don't parse XML with regex, check: Using regular expressions with HTML tags Instead, you can use and a proper xml parser. What is your environment, language ?

    Test :

    Input file

     <LOB>
        <LOBStatus>
            <LOBStatusInfo>
                <LOB>Mobile</LOB>
                <Status>Active</Status>
            </LOBStatusInfo>
            <LOBStatusInfo>
                <LOB>Voice</LOB>
                <Status>Active</Status>
            </LOBStatusInfo>
            <LOBStatusInfo>
                <LOB>Internet</LOB>
                <Status>Disconnect</Status>
            </LOBStatusInfo>
        </LOBStatus>
    </LOB>
    

    Command

    (just an example, now in shell, but the query can be used in any language of your choice)

    xmllint --xpath '//LOB[text()="Voice"]/../Status/text()' file.xml
    

    or

    xmllint --xpath '//LOB[text()="Voice"]/following-sibling::Status/text()' file.xml
    

    Output:

    Active