Search code examples
shellweb-scrapinggpsscreen-scrapingwikipedia

How to scrape Wikipedia GPS latitude/longitude?


I have been wondering how is it possible to scrap Wikipedia information. For example, I have a list of world cities and want to obtain their approximate latitude and longitude. Take Miami as an example. When I type curl https://en.wikipedia.org/wiki/Miami | grep -E '(latitude|longitude)', somewhere in the HTML there will be a tag mark like below.

<span class="latitude">25°46′31″N</span> <span class="longitude">80°12′31″W</span>

I know I can extract it with some regex string, but I speak a very poor regexish. Can some of you help me on this?


Solution

  • With and :

    $ xidel -se '
        concat(
            (//span[@class="latitude"]/text())[1],
            " ",
            (//span[@class="longitude"]/text())[1]
        )
    ' 'https://en.wikipedia.org/wiki/Miami'
    

    Output

    25°46′31″N 80°12′31″W
    

    Or

    saxon-lint --html --xpath '<XPATH EXP>' <URL>
    

    If you want most known tools:

    curl -s 'https://en.wikipedia.org/wiki/Miami' > Miami.html
    xmlstarlet format -H Miami.html 2>/dev/null | sponge Miami.html
    xmlstarlet sel -t -v '<XPATH EXP>' Miami.html
    

    Not mentioned, but regex are not the right tool to parse HTML