Search code examples
bashsedfile-descriptorxmllint

Send command output back to previous subshell in pipe for processing


Given this html5 page, process it with xmllint interactively with previous subshell using a file descriptor.
To be applied on xml2xpath OS project.

How to reproduce: Run the script snippet on "The problem" section

Base command is:

(echo 'xpath //*'; echo "bye") | xmllint --shell html5.html

Which gives the source output to be processed:

/ > xpath //*
Object is a Node Set :
Set contains 346 nodes:
1  ELEMENT html
    default namespace href=http://www.w3.org/1999/xhtml
    ATTRIBUTE lang
      TEXT
        content=en
    ATTRIBUTE dir
      TEXT
        content=ltr
2  ELEMENT head
3  ELEMENT title
...
202  ELEMENT div
    default namespace href=http://www.w3.org/1999/xhtml
203  ELEMENT p
204  ELEMENT code
205  ELEMENT math
    default namespace href=http://www.w3.org/1998/Math/MathML
...
345  ELEMENT mo
346  ELEMENT mn
/ > bye

The goal is to join lines containing namespace to previous line, show n ELEMENT name as n name, ignore the rest (and send more commands to xmllint).
The following command gives the correct lines expected to appear on previous subshell

(echo 'xpath //*' )| xmllint --shell $proj/git/xml2xpath/tests/resources/html5.html | \
sed -nEe '{ :a; $!N;s/^([0-9]{1,5}) *ELEMENT *([^ ]*)\n +(default)? ?namespace ([a-z]+)? ?href=([^=]+)/\1 \2 \3\4=\5/;ta; s/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/; /^[1-9]/ P;D }'

1 html default=http://www.w3.org/1999/xhtml
2 head
3 title
4 link
5 link
6 link
7 link
8 body
9 h1
10 h2
...

The problem
Sending lines back to subshell through file descriptor does not join lines correctly, namespace info appears on its own item inside arrns array (next code sample).
So reading from file descriptor and processing with sed to fill an array is not working as expected. Also, trying to avoid post-processing or parsing the file more than 1 time in this stage.

Best approach so far is:

#!/bin/bash
wget --no-clobber "https://www.w3.org/TR/XHTMLplusMathMLplusSVG/sample.xhtml" -O html5.html

fname='xff'
[ ! -p "$fname" ] && mkfifo "$fname"
exec 3<>"$fname"

cat /dev/null > tmp.log

stop='dir xxxxxxx'

function parse_line(){
    while read -r -u 3 xline; do 
        printf "%s\n" "$xline"
        if [ "$xline" == "/ > $stop" ]; then 
            break 
        fi
    done | sed -nEe '{ :a; $!N;s/^([0-9]{1,5}) *ELEMENT *([^ ]*)\n +(default)? ?namespace ([a-z]+)? ?href=([^=]+)/\1 \2 \3\4=\5/;ta; s/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/; /^[1-9]|namespace/ P;D }'
}

(
    echo 'xpath //*'
    echo "$stop"
    IFS=$'\n' read -r -d '' -a arrns < <(parse_line && printf '\0')
    
    # print to file temporarily for debugging and avoid sending to xmllint shell 
    printf "%s\n" "${arrns[@]}" >> tmp.log
    echo "OUT OF LOOP 1 ${#arrns[@]}" >> tmp.log
    echo "bye"
) | xmllint --shell html5.html >&3

exec 3>&-
rm xff
cat tmp.log

Parsing all lines from fd 3 to a variable and then applying sed gave the same result.

Showing contents of arrns on tmp.log (almost correct):

1 html
default namespace href=http://www.w3.org/1999/xhtml
2 head
3 title
4 link
5 link
...
239 math
default namespace href=http://www.w3.org/1998/Math/MathML
...
OUT OF LOOP 1 354

Lines 1 and 239 on the sample should look

239 math default=http://www.w3.org/1998/Math/MathML

Which could allow with a bit of processing to forward this command to xmllint from the same subshell to set namespaces as they appear in the document.

setns default=http://www.w3.org/1998/Math/MathML

Solution

  • For some reason this while read loop does not emulate the xmllint output and made the original sed command to fail

    while read -r -u 3 xline; do ... ; done
    

    Fixed the sed command after the loop and now the output is correct

    sed -E -e :a -e '/^[1-9]/,/^(default|namespace)/ { $!N;s/\n(default|namespace)/ \1/;ta }' \
    -e 's/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/' \
    -e 's/(default)? ?namespace( [a-z0-9]+)? ?href=([^=]+)/\1\2=\3/g' \
    -e '/^[1-9]/ P;D'
    

    Order matters here it seems

    1- Join lines as expected
    sed -E -e :a -e '/^[1-9]/,/^(default|namespace)/ { $!N;s/\n(default|namespace)/ \1/;ta }'

    1  ELEMENT html
    default namespace href=http://www.w3.org/1999/xhtml
    

    Becomes

    1  ELEMENT html default namespace href=http://www.w3.org/1999/xhtml
    

    2- Handle *ELEMENT* lines
    -e 's/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/'

    Before: 1 ELEMENT html
    After : 1 html

    3- Handle *namespace* part of step 1 result
    -e 's/(default)? ?namespace( [a-z0-9]+)? ?href=([^=]+)/\1\2=\3/g'

    Before: 1 html default namespace href=http://www.w3.org/1999/xhtml
    After : 1 html default=http://www.w3.org/1999/xhtml

    4- Print lines starting with a number
    -e '/^[1-9]/ P;D'

    Wrong result:

    1 html
    default namespace href=http://www.w3.org/1999/xhtml
    2 head
    ...
    191 svg:svg
    namespace svg href=http://www.w3.org/2000/svg
    ...
    

    Correct result after fix:

    1 html default=http://www.w3.org/1999/xhtml
    2 head
    ...
    191 svg:svg svg=http://www.w3.org/2000/svg
    ...