Search code examples
bashawksedgrepcut

extract specific tag from html output of a python script


I have a program that should be piped with grep command, the outpu of my program is sth like this:

<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=

and so on...

I run a python script:

./python.py | grep -Po '(?<=<cite>)([^</cite>])'

in order to grep every thing between cite tag...

Can you help me?


Solution

  • You need to make a proper use of lookaround feature, your lookbehind is fine but lookahead is not. Try this:

    grep -Po "(?<=<cite>).*?(?=</cite>)"
    

    Ex:

     echo '<cite>www.site.com/sdds/ass</cite>A-"><div Class="sa_mc"><div class="sb_tlst"><h3><a href=' | grep -Po "(?<=<cite>).*?(?=</cite>)"
     www.site.com/sdds/ass
    

    Disclaimer: It's a bad practice to parse XML/HTML with regex. You should probably use a parser like xmllint instead.