Given this html5 page, process it with xmllint
interactively with previous subshell using a file descriptor.
To be applied on xml2xpath OS project.
How to reproduce: Run the script snippet on "The problem" section
Base command is:
(echo 'xpath //*'; echo "bye") | xmllint --shell html5.html
Which gives the source output to be processed:
/ > xpath //*
Object is a Node Set :
Set contains 346 nodes:
1 ELEMENT html
default namespace href=http://www.w3.org/1999/xhtml
ATTRIBUTE lang
TEXT
content=en
ATTRIBUTE dir
TEXT
content=ltr
2 ELEMENT head
3 ELEMENT title
...
202 ELEMENT div
default namespace href=http://www.w3.org/1999/xhtml
203 ELEMENT p
204 ELEMENT code
205 ELEMENT math
default namespace href=http://www.w3.org/1998/Math/MathML
...
345 ELEMENT mo
346 ELEMENT mn
/ > bye
The goal is to join lines containing namespace
to previous line, show n ELEMENT name
as n name
, ignore the rest (and send more commands to xmllint
).
The following command gives the correct lines expected to appear on previous subshell
(echo 'xpath //*' )| xmllint --shell $proj/git/xml2xpath/tests/resources/html5.html | \
sed -nEe '{ :a; $!N;s/^([0-9]{1,5}) *ELEMENT *([^ ]*)\n +(default)? ?namespace ([a-z]+)? ?href=([^=]+)/\1 \2 \3\4=\5/;ta; s/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/; /^[1-9]/ P;D }'
1 html default=http://www.w3.org/1999/xhtml
2 head
3 title
4 link
5 link
6 link
7 link
8 body
9 h1
10 h2
...
The problem
Sending lines back to subshell through file descriptor does not join lines correctly, namespace
info appears on its own item inside arrns
array (next code sample).
So reading from file descriptor and processing with sed
to fill an array is not working as expected. Also, trying to avoid post-processing or parsing the file more than 1 time in this stage.
Best approach so far is:
#!/bin/bash
wget --no-clobber "https://www.w3.org/TR/XHTMLplusMathMLplusSVG/sample.xhtml" -O html5.html
fname='xff'
[ ! -p "$fname" ] && mkfifo "$fname"
exec 3<>"$fname"
cat /dev/null > tmp.log
stop='dir xxxxxxx'
function parse_line(){
while read -r -u 3 xline; do
printf "%s\n" "$xline"
if [ "$xline" == "/ > $stop" ]; then
break
fi
done | sed -nEe '{ :a; $!N;s/^([0-9]{1,5}) *ELEMENT *([^ ]*)\n +(default)? ?namespace ([a-z]+)? ?href=([^=]+)/\1 \2 \3\4=\5/;ta; s/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/; /^[1-9]|namespace/ P;D }'
}
(
echo 'xpath //*'
echo "$stop"
IFS=$'\n' read -r -d '' -a arrns < <(parse_line && printf '\0')
# print to file temporarily for debugging and avoid sending to xmllint shell
printf "%s\n" "${arrns[@]}" >> tmp.log
echo "OUT OF LOOP 1 ${#arrns[@]}" >> tmp.log
echo "bye"
) | xmllint --shell html5.html >&3
exec 3>&-
rm xff
cat tmp.log
Parsing all lines from fd 3 to a variable and then applying sed
gave the same result.
Showing contents of arrns
on tmp.log
(almost correct):
1 html
default namespace href=http://www.w3.org/1999/xhtml
2 head
3 title
4 link
5 link
...
239 math
default namespace href=http://www.w3.org/1998/Math/MathML
...
OUT OF LOOP 1 354
Lines 1 and 239 on the sample should look
239 math default=http://www.w3.org/1998/Math/MathML
Which could allow with a bit of processing to forward this command to xmllint
from the same subshell to set namespaces as they appear in the document.
setns default=http://www.w3.org/1998/Math/MathML
For some reason this while read
loop does not emulate the xmllint
output and made the original sed
command to fail
while read -r -u 3 xline; do ... ; done
Fixed the sed
command after the loop and now the output is correct
sed -E -e :a -e '/^[1-9]/,/^(default|namespace)/ { $!N;s/\n(default|namespace)/ \1/;ta }' \
-e 's/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/' \
-e 's/(default)? ?namespace( [a-z0-9]+)? ?href=([^=]+)/\1\2=\3/g' \
-e '/^[1-9]/ P;D'
Order matters here it seems
1- Join lines as expected
sed -E -e :a -e '/^[1-9]/,/^(default|namespace)/ { $!N;s/\n(default|namespace)/ \1/;ta }'
1 ELEMENT html
default namespace href=http://www.w3.org/1999/xhtml
Becomes
1 ELEMENT html default namespace href=http://www.w3.org/1999/xhtml
2- Handle *ELEMENT*
lines
-e 's/^([0-9]{1,5}) *ELEMENT *([^ ]*)/\1 \2/'
Before: 1 ELEMENT html
After : 1 html
3- Handle *namespace*
part of step 1 result
-e 's/(default)? ?namespace( [a-z0-9]+)? ?href=([^=]+)/\1\2=\3/g'
Before: 1 html default namespace href=http://www.w3.org/1999/xhtml
After : 1 html default=http://www.w3.org/1999/xhtml
4- Print lines starting with a number
-e '/^[1-9]/ P;D'
Wrong result:
1 html
default namespace href=http://www.w3.org/1999/xhtml
2 head
...
191 svg:svg
namespace svg href=http://www.w3.org/2000/svg
...
Correct result after fix:
1 html default=http://www.w3.org/1999/xhtml
2 head
...
191 svg:svg svg=http://www.w3.org/2000/svg
...