Search code examples
linuxbashawkwgetxmllint

Finding Specific Date from a webpage?


Ok so I have been tackling this for a few days now, I have tried multiple things but this current implementation I believe I am closest to. I am looking to retrieve the last Update: date from the following url: https://steamcommunity.com/sharedfiles/filedetails/changelog/2016338122

I can't guarantee it'll be the same link at the same time but the last number will change and will be looped through multiple pages to retrieve the same date.

This is what I have currently:

#!/usr/bin/env bash

## this is just a list of the mods i want to check.
activeModList=($(echo "$mods" | tr ',' '\n'))

for mod in "${activeModList[@]}"
do
   :
   modDirectory="modHTML/$mod.html"
   steamLink="https://steamcommunity.com/sharedfiles/filedetails/changelog/$mod"
   wget -O $modDirectory $steamLink
done

for mod in "${activeModList[@]}"
do
    :

    modDirectory="modHTML/$mod"
    modHTML="xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" $modDirectory.html"

    lastUpdateTime=$(awk '/Update: /{p=1}p' "$modHTML")
    echo "$mod last updated: $lastUpdateTime"
done

Now just to make things clearer, the $activeModList contains an array of mod numbers to iterate through. Currently it saves the html files to a specific folder.

I then attempt to use xmllint and awk to parse the date from the webpage.

It is worth noting that when I call the xlint command I receive:

modHTML/928102085.html:294: HTML parser error : Unexpected end tag : b
re you sure you want to revert changes to your Workshop item back to <b>%1$s</b>
                                                                               ^
modHTML/928102085.html:426: HTML parser error : htmlParseEntityRef: no name
s item has been removed from the community because it violates Steam Community &
                                                                               ^
<div class="changelog headline">&#13;
                                                        Update: 15 Aug, 2021 @ 5:10am

Now I can't guarantee I won't get warnings/ errors like this every time as I will iterate through potentially hundreds of webpages similar to this so I am wondering if I can parse the output of xlint to just retrieve the update date and time at the end.

Many thanks in advance guys.

Edit:

The output of lastUpdateTime creates these syntax errors:

awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/928102085.html' for reading (No such file or directory)
928102085 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/731604991.html' for reading (No such file or directory)
731604991 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1404697612.html' for reading (No such file or directory)
1404697612 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/618916953.html' for reading (No such file or directory)
618916953 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/566885854.html' for reading (No such file or directory)
566885854 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/924933745.html' for reading (No such file or directory)
924933745 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1609138312.html' for reading (No such file or directory)

Solution

  • Couple issues with the current code:

    • by default awk expects a file as input but modHTML is a (string) variable; to have awk process a variable you can use a here-string to simulate feeding the string as a file to awk, eg: awk '/Update: /{p=1}p' <<< "$modHTML"
    • modHTML="xmllint --nowarning ..." is assigning the string xmllint --nowarning ... to modHTML when what you really want is to run the xmllint call and store the results in the modHTML variable, eg, modHTML=$(xmllint --nowarning ...)

    Rolling these changes into OP's current code:

    for mod in "${activeModList[@]}"
    do
        modDirectory="modHTML/$mod"
    
        modHTML=$(xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" "$modDirectory.html")
    
        lastUpdateTime=$(awk '/Update: /{p=1}p' <<< "$modHTML")
    
        # uncomment following line to assist with debugging; this will
        # show you exactly what's stored in the variables thus allowing
        # you to verify if your code is doing what you think it's doing
    
        # typeset -p modHTML lastUpdateTime
    
        echo "$mod last updated: $lastUpdateTime"
    done
    

    NOTES:

    • I don't use xmllint so I can't comment on whether or not this is a valid call but at least the proposed code changes should allow the OP to get a bit closer to the desired result
    • the awk call can probably be tweaked to provide a more compact answer but I'll leave that up to the OP to work on (once we get past the syntax errors and start generating actual output)