Search code examples
pythonpython-3.xhttppython-requestshttp-redirect

Python - How to get the page Wikipedia will redirect me to?


I want to store a few different Wikipedia links but I don't want to store two different links to the same page twice. For example the following links are different but they point to the same Wikipedia page:

https://en.wikipedia.org/w/index.php?title=(1S)-1-Methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no 
https://en.wikipedia.org/w/index.php?title=(1S)-1-methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
__________________________________________________|___________________________________________________________

The only difference is that one uppercase character. Or the following links:

https://en.wikipedia.org/wiki/(0,1)-matrix 
https://en.wikipedia.org/wiki/(0,1)_matrix 
___________________________________|______ 

That are only different because one has '-' and the other has '_'(' '). So what I want is storing only one of them or the following links:

https://en.wikipedia.org/wiki/Tetrahydroharman 
https://en.wikipedia.org/wiki/Logical_matrix 

I have already tried the answer to this SO question. But it didn't work for me. (The result is the initial URL for me, not the one wiki redirects me to in the browser) So how can I achieve what I'm looking for!?


Solution

  • Since Wikipedia doesn't have a proper 301/302 redirection what happens when you open the link is a proper 200 success response is returned and then url is changed using JS

    I came up with a quick workable solution. First, remove &redirect=no from the URL

    In [42]: import requests
    
    In [43]: r = requests.get('https://en.wikipedia.org/w/index.php?title=(1S)-1-Met
        ...: hyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole')
    
    In [44]: tmp = r.content.replace('<link rel="canonical" href="', 'r@ndom}-=||').
        ...: split('r@ndom}-=||')[-1]
    
    In [45]: idx = tmp.find('"/>')
    
    In [46]: real_link = tmp[:idx]
    
    In [47]: real_link
    Out[47]: 'https://en.wikipedia.org/wiki/Tetrahydroharman'
    

    The real URL value is stored in <link rel="canonical" href=" tag.

    You can use above method which is good enough for your use case or you can use libraries like bs4 to parse the page and the get the link or use regex the extract the link.