Search code examples
regexlinuxsedmultilineregex-group

Sed - How to Print Regex Groups in Multi-Line?


Input file (test):

123456<a id="id1" name="name1" href="link1">This is link1</a>789<a id="id2"
href="link2">This is link2</a>0123

Desired output:

link1
link2

What I have done:

$ sed -e '/<a/{:begin;/<\/a>/!{N;b begin};s/<a\([^<]*\)<\/a>/QQ/;/<a/b begin}' test
123456QQ789QQ0123

Question: How do you print the regex groups in sed (multiline)?


Solution

  • If you use sed like this:

    sed -e '/<a/{:begin;/<\/a>/!{N;b begin};s/<a\([^<]*\)<\/a>/\n/;/<a/b begin}'
    

    then it will print in different lines:

    123456
    789
    0123
    

    But is this what you are trying to print? Or you want to print text in hrefs?

    Update 1: To get hrefs between well formed <a and </a>

    sed -r '$!N; s~\n~~; s~(<a )~\n\1~ig; s~[^<]*<a[^>]*href\s*=\s*"([^"]*)"[^\n]*~\1\n~ig' test
    

    output

    link1
    link2
    

    Update 2: Getting above output using bash regex feature

    regex='href="([^"]*)"'
    while read line; do
       [[ $line =~ $regex ]] || continue
       echo ${BASH_REMATCH[1]}
    done < test
    

    output

    link1
    link2