Search code examples
bashshelltagsxmllintextract-value

how to search for a specific term within the xml and extract its value within the specific tag


I am trying to figure out, first search the term within the specific tag (article tag) and then retrieve the value from that specific tag within the article tag.

I can retrieve the value from a specific tag,

<article>
    <author>
        <name>Example Name 1</name>
        <title>example title 2</title>
    </author>
    <title>article title 1</title>
    <publicationDate>2022-02-12</publicationDate>
    <text>blah1 blah1 blah1</text>
    <reference>10000</reference>
</article>
<article>
    <author>
        <name>Example Name 2</name>
        <title>example title 2</title>
    </author>
    <title>article title 1</title>
    <publicationDate>2022-02-13</publicationDate>
    <text>blah1 blah1 blah1</text>
    <reference>10001</reference>
</article>

xmllint --xpath "string(//title)" file.xml

But how can I search and then retrieve the value within the article tags? It will be each time a different reference number, then I need to extract the value from that specific reference.

Thank you for your help


Solution

  • If I understand your intention correctly, you should be able to parameterize your xpath search string using a bash variable containing the reference number that you are interested in. Note, that I modified your example XML to be wrapped in tags, so you will need to modify the xpath per your XML structure.

    Script contents:

    #!/bin/bash
    
    ref_no=${1:-10001}
    src_xml=${2:-/tmp/foo/s.xml}
    
    title=$(xmllint --xpath "string(/articles/article[reference=${ref_no}]/title)" "${src_xml}")
    printf "Reference: %s, Title: %s\n" "${ref_no}" "${title}"
    

    Output:

    $ ./script 10000
    Reference: 10000, Title: article title 1
    
    $ ./script 10001
    Reference: 10001, Title: article title 2
    

    For clarity, here is the test XML that I utilized:

    <articles>
    <article>
        <author>
            <name>Example Name 1</name>
            <title>example title 2</title>
        </author>
        <title>article title 1</title>
        <publicationDate>2022-02-12</publicationDate>
        <text>blah1 blah1 blah1</text>
        <reference>10000</reference>
    </article>
    <article>
        <author>
            <name>Example Name 2</name>
            <title>example title 2</title>
        </author>
        <title>article title 2</title>
        <publicationDate>2022-02-13</publicationDate>
        <text>blah1 blah1 blah1</text>
        <reference>10001</reference>
    </article>
    </articles>
    

    Per the OP's question in the comments below, here is a variation if the is a string:

    Script contents:

    #!/bin/bash
    
    ref_no=${1:-a10001}
    src_xml=${2:-/tmp/s.xml}
    
    title=$(xmllint --xpath "//*[reference=\"${ref_no}\"]/title/text()" "${src_xml}")
    printf "Reference: %s, Title: %s\n" "${ref_no}" "${title}"
    

    Note that you have to escape the double quotes surrounding the ${ref_no} variable and then use the text() function to extract the text from the element.

    Further, note that the source XML's second <reference> tag value was updated to 'a10001':

    <reference>a10001</reference>
    

    Output:

    $ ./script a10001
    Reference: a10001, Title: article title 2