Search code examples
bashsedhttrack

wrong srcset attributes from httrack


I have spidered a website with httracks and a lot of files on different levels are generated. But the website uses picture / source tags with srcset attributes which httrack does not handle, all those pictures does not work well offline.

httrack can see the links if a use the option Attempt to detect all links (even in unknown tags/javascript code) (in winhttrack) and copied all images to the local storage. But it did not change the path to relative.

Now I need a script (powershell/gnu bash) which can edit all the html files to adapt the pathes in the srcsets to the correct relative path.

my idea would be a recursion for each folder with an additional ../ as parameter to insert/ replace with sed.

what to do:

example files:

index.html
cat1/product1.html
cat2/option3/product5.html

each contains some picture tags like:

<picture>
     <source srcset="/images/img1_low.jpg, /images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="/images/img1_med.jpg, /images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="/images/img1_hi.jpg, /images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="../images/img1_hi.jpg" />
</picture>

inside the image tag the path is always correct done from httrack: (images/img1_hi.jpg, ../images/img1_hi.jpg, ../../images/img1_hi.jpg)

but the source tag also must contain the matching pathes:

in index.html:

<picture>
     <source srcset="images/img1_low.jpg, images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="images/img1_med.jpg, images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="images/img1_hi.jpg, images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="images/img1_hi.jpg" />
</picture>

in cat1/product1.html:

<picture>
     <source srcset="../images/img1_low.jpg, ../images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="../images/img1_med.jpg, ../images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="../images/img1_hi.jpg, ../images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="../images/img1_hi.jpg" />
</picture>

in cat2/option3/product5.html:

<picture>
     <source srcset="../../images/img1_low.jpg, ../../images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="../../images/img1_med.jpg, ../../images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="../images/img1_hi.jpg, ../../images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="../../images/img1_hi.jpg" />
</picture>

my attempt:

#!/usr/bin/bash

function workfolder {
    # $1 = current folder
    # $2 = prefix upfolders

    pushd $PWD
    cd $1

    for i in $( ls ) ; do
        if [ -d $i ] ; then
            workfolder $i ../$2
        fi
    done

    for i in $( ls *.html ) ; do
        sed -i 's/srcset="images/srcset="$2images/g' $i
        sed -i 's/, images/, $2images/g' $i
    done

    popd

}

workfolder .

aside of too much errors the $2 in the sed replacement is not solved but replaced litteraly.


Solution

  • #!/usr/bin/bash
    function workfolder {
        # $1 = current folder
        # $2 = prefix upfolders
    
        pushd $PWD > /dev/null
        cd $1
        echo "=====^ $PWD ====="
        for i in $( ls ) ; do
            if [ -d $i ] ; then
                workfolder $i ..\\/$2
            fi
        done
        for i in $( ls *.html ) ; do
            echo " working on: $PWD/$i with $2"
            sed -i 's/srcset="image/srcset="'$2'image/g' $i
            sed -i 's/\,\ image/\,\ '$2'image/g' $i
        done
        popd > /dev/null
        echo "=====v $PWD ====="
    }
    
    workfolder .
    

    traps are: using $2 in the sed command at all (1st attempt was not expanded) and the correct escaping of ../ as 2nd parameter in a form usable in the sed commands