I have some links that I am collecting from a sitemap and twitter. The problem is that some links are in Arabic like this one:
https://www.aljazeera.net/videos/2021/3/1/ياقوتيا-مدينة-روسية-يتجمد-فيها-كل-شيء
I am trying to unshorten the shortened twitter links and decode the Arabic encoded links to get links that look like this:
https://www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1
If your goal is to take a url with odd characters in it, and convert it to the %XX
format, you can use python's builtin, urllib
, to decode the link:
>>> import urllib
>>> oddlink = 'https://www.aljazeera.net/videos/2021/3/1/ياقوتيا-مدينة-روسية-يتجمد-فيها-كل-شيء'
>>> goodlink = urllib.parse.quote(oddlink)
>>> print(goodlink)
https%3A//www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1
Keep in mind that it will parse the :
after https
to %3A
. You can manually override this:
>>> goodlink = goodlink[0:5] + ':' + goodlink[6:]
>>> print(goodlink)
https://www.aljazeera.net/videos/2021/3/1/%D9%8A%D8%A7%D9%82%D9%88%D8%AA%D9%8A%D8%A7-%D9%85%D8%AF%D9%8A%D9%86%D8%A9-%D8%B1%D9%88%D8%B3%D9%8A%D8%A9-%D9%8A%D8%AA%D8%AC%D9%85%D8%AF-%D9%81%D9%8A%D9%87%D8%A7-%D9%83%D9%84-%D8%B4%D9%8A%D8%A1
Or, you can add the :
as a 'safe' character, meaning that urllib.parse
will ignore it, and leave it be:
>>> urllib.parse.quote(oddlink, safe='/:')
The /
is part of the safe characters because it is a very important part of links:
https://www.google.com/ #with slash
https:%2F%2Fwww.google.com%2F #without slash
The /
character is included by default in the safe characters, but when changing the safe characters, you need to make sure to include it.