I want to store a few different Wikipedia links but I don't want to store two different links to the same page twice. For example the following links are different but they point to the same Wikipedia page:
https://en.wikipedia.org/w/index.php?title=(1S)-1-Methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
https://en.wikipedia.org/w/index.php?title=(1S)-1-methyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole&redirect=no
__________________________________________________|___________________________________________________________
The only difference is that one uppercase character. Or the following links:
https://en.wikipedia.org/wiki/(0,1)-matrix
https://en.wikipedia.org/wiki/(0,1)_matrix
___________________________________|______
That are only different because one has '-' and the other has '_'(' '). So what I want is storing only one of them or the following links:
https://en.wikipedia.org/wiki/Tetrahydroharman
https://en.wikipedia.org/wiki/Logical_matrix
I have already tried the answer to this SO question. But it didn't work for me. (The result is the initial URL for me, not the one wiki redirects me to in the browser) So how can I achieve what I'm looking for!?
Since Wikipedia doesn't have a proper 301/302 redirection what happens when you open the link is a proper 200 success response is returned and then url is changed using JS
I came up with a quick workable solution. First, remove &redirect=no
from the URL
In [42]: import requests
In [43]: r = requests.get('https://en.wikipedia.org/w/index.php?title=(1S)-1-Met
...: hyl-2,3,4,9-tetrahydro-1H-pyrido-3,4-b-indole')
In [44]: tmp = r.content.replace('<link rel="canonical" href="', 'r@ndom}-=||').
...: split('r@ndom}-=||')[-1]
In [45]: idx = tmp.find('"/>')
In [46]: real_link = tmp[:idx]
In [47]: real_link
Out[47]: 'https://en.wikipedia.org/wiki/Tetrahydroharman'
The real URL value is stored in <link rel="canonical" href="
tag.
You can use above method which is good enough for your use case or you can use libraries like bs4 to parse the page and the get the link or use regex the extract the link.