I have spidered a website with httracks and a lot of files on different levels are generated. But the website uses picture
/ source
tags with srcset
attributes which httrack does not handle, all those pictures does not work well offline.
httrack can see the links if a use the option Attempt to detect all links (even in unknown tags/javascript code)
(in winhttrack) and copied all images to the local storage. But it did not change the path to relative.
Now I need a script (powershell/gnu bash) which can edit all the html files to adapt the pathes in the srcset
s to the correct relative path.
my idea would be a recursion for each folder with an additional ../
as parameter to insert/ replace with sed
.
example files:
index.html
cat1/product1.html
cat2/option3/product5.html
each contains some picture tags like:
<picture>
<source srcset="/images/img1_low.jpg, /images/img1_low_ret.jpg x2" media="(max-width: 470px)">
<source srcset="/images/img1_med.jpg, /images/img1_med_ret.jpg x2" media="(max-width: 960px)">
<source srcset="/images/img1_hi.jpg, /images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
<img src="../images/img1_hi.jpg" />
</picture>
inside the image tag the path is always correct done from httrack
: (images/img1_hi.jpg
, ../images/img1_hi.jpg
, ../../images/img1_hi.jpg
)
but the source tag also must contain the matching pathes:
in index.html:
<picture>
<source srcset="images/img1_low.jpg, images/img1_low_ret.jpg x2" media="(max-width: 470px)">
<source srcset="images/img1_med.jpg, images/img1_med_ret.jpg x2" media="(max-width: 960px)">
<source srcset="images/img1_hi.jpg, images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
<img src="images/img1_hi.jpg" />
</picture>
in cat1/product1.html:
<picture>
<source srcset="../images/img1_low.jpg, ../images/img1_low_ret.jpg x2" media="(max-width: 470px)">
<source srcset="../images/img1_med.jpg, ../images/img1_med_ret.jpg x2" media="(max-width: 960px)">
<source srcset="../images/img1_hi.jpg, ../images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
<img src="../images/img1_hi.jpg" />
</picture>
in cat2/option3/product5.html:
<picture>
<source srcset="../../images/img1_low.jpg, ../../images/img1_low_ret.jpg x2" media="(max-width: 470px)">
<source srcset="../../images/img1_med.jpg, ../../images/img1_med_ret.jpg x2" media="(max-width: 960px)">
<source srcset="../images/img1_hi.jpg, ../../images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
<img src="../../images/img1_hi.jpg" />
</picture>
my attempt:
#!/usr/bin/bash
function workfolder {
# $1 = current folder
# $2 = prefix upfolders
pushd $PWD
cd $1
for i in $( ls ) ; do
if [ -d $i ] ; then
workfolder $i ../$2
fi
done
for i in $( ls *.html ) ; do
sed -i 's/srcset="images/srcset="$2images/g' $i
sed -i 's/, images/, $2images/g' $i
done
popd
}
workfolder .
aside of too much errors the $2
in the sed replacement is not solved but replaced litteraly.
#!/usr/bin/bash
function workfolder {
# $1 = current folder
# $2 = prefix upfolders
pushd $PWD > /dev/null
cd $1
echo "=====^ $PWD ====="
for i in $( ls ) ; do
if [ -d $i ] ; then
workfolder $i ..\\/$2
fi
done
for i in $( ls *.html ) ; do
echo " working on: $PWD/$i with $2"
sed -i 's/srcset="image/srcset="'$2'image/g' $i
sed -i 's/\,\ image/\,\ '$2'image/g' $i
done
popd > /dev/null
echo "=====v $PWD ====="
}
workfolder .
traps are: using $2
in the sed command at all (1st attempt was not expanded) and the correct escaping of ../
as 2nd parameter in a form usable in the sed commands