Search code examples
pythonrelative-urlpyquery

make_links_absolute() results in broken absolute URLs


I need to convert relative URLs from a HTML page to absolute ones. I'm using pyquery for parsing.

For instance, this page http://govp.info/o-gorode/gorozhane has relative URLs in the source code, like

<a href="o-gorode/gorozhane?page=2">2</a>

(this is the pagination link at the bottom of the page). I'm trying to use make_links_absolute():

import requests
from pyquery import PyQuery as pq

page_url = 'http://govp.info/o-gorode/gorozhane'
resp = requests.get(page_url)
page = pq(resp.text)

page.make_links_absolute(page_url)

but it seems that this breaks the relative links:

print(page.find('a[href*="?page=2"]').attr['href'])

# prints            http://govp.info/o-gorode/o-gorode/gorozhane?page=2
# expected value    http://govp.info/o-gorode/gorozhane?page=2

As you can see there is doubled o-gorode in the middle of the final URL that definitely will produce 404 error.

Google Chrome is good in URL conversion

Internally pyquery uses urljoin from the standard urllib.parse module, somewhat like this:

from urllib.parse import urljoin
urljoin('http://example.com/one/', 'two')

# -> 'http://example.com/one/two'

It's ok, but there are a lot of sites that have, hmm, unusual relative links with a full path.

And in this case urljoin will give us an invalid absolute link:

urljoin('http://govp.info/o-gorode/gorozhane', 'o-gorode/gorozhane?page=2')

# -> 'http://govp.info/o-gorode/o-gorode/gorozhane?page=2'

I believe such relative links are not very valid, but Google Chrome has no problem to deal with them; so I guess this is kind of normal across the web.

Are there any advice how to solve this problem? I tried furl but it does the same join.


Solution

  • In this particular case, the page in question contains

    <base href="http://govp.info/"/>
    

    which instructs the browser to use this for resolving any relative links. The <base> element is optional, but if it's there, you must use it instead of the page's actual URL.

    In order to do as the browser does, extract the base href and use it in make_links_absolute().

    import requests
    from pyquery import PyQuery as pq
    
    page_url = 'http://govp.info/o-gorode/gorozhane'
    resp = requests.get(page_url)
    page = pq(resp.text)
    
    base = page.find('base').attr['href']
    if base is None:
        base = page_url    # the page's own URL is the fallback
    
    page.make_links_absolute(base)
    
    for a in page.find('a'):
         if 'href' in a.attrib and 'govp.info' in a.attrib['href']:
             print(a.attrib['href'])
    

    prints

    http://govp.info/assets/images/map.png
    http://govp.info/podpiska.html
    http://govp.info/
    http://govp.info/#order
    ...
    http://govp.info/o-gorode/gorozhane
    http://govp.info/o-gorode/gorozhane?page=2
    http://govp.info/o-gorode/gorozhane?page=3
    http://govp.info/o-gorode/gorozhane?page=4
    http://govp.info/o-gorode/gorozhane?page=5
    http://govp.info/o-gorode/gorozhane?page=6
    http://govp.info/o-gorode/gorozhane?page=2
    http://govp.info/o-gorode/gorozhane?page=17
    http://govp.info/bannerclick/264
    ...
    http://doska.govp.info/cat-biznes-uslugi/
    http://doska.govp.info/cat-transport/legkovye-avtomobili/
    http://doska.govp.info/
    http://govp.info/
    

    which seems to be correct.