Search code examples
pythonanchorlibcurlpycurlidn

URL elongator using libcurl : IRI/IDN and fragments issues


I'm trying to code an URL elongator using libcurl through pycURL (if you don't know pycURL don't go away, it's a libcurl issue).
An URL elongator is the reverse of an URL shortener : the aim is to get the final URL without any redirection left, so we can have the real domain of the link.
Here is the code showing what I'm trying to do :

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import sys
import pycurl 
import urllib    

url="https://t.co/0u0Jb2Pw7k" #Wikipedia Colonne Vendôme

c = pycurl.Curl()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.FOLLOWLOCATION, 1) # Allow URL elongation
c.setopt(pycurl.SSL_VERIFYHOST, 0)
c.setopt(pycurl.SSL_VERIFYPEER, 0)
c.setopt(pycurl.MAXREDIRS, 25)
c.setopt(pycurl.AUTOREFERER, 1)
c.setopt(pycurl.WRITEFUNCTION, lambda x: None) # No output of body. Don't care
c.setopt(pycurl.HEADER, 1) # For debug only
c.setopt(pycurl.VERBOSE, 1) # For debug only
c.setopt(pycurl.USERAGENT, "Opera/12.02 (X11; Linux i686; Opera Cqcb Style; U; fr-FR) Presto/2.9.201 Version/12.02/AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu")
c.setopt(pycurl.REFERER, url)

try:
    c.perform()
except:
    pass
print c.getinfo(pycurl.HTTP_CODE) , c.getinfo(pycurl.EFFECTIVE_URL)

There are multiple issues :

  1. libcURL doesn't seem able to handle IRI or IDN. In the case given in the code above, the URL should be elongated to https://fr.wikipedia.org/wiki/Colonne_Vendôme, but libcurl return https://fr.wikipedia.org/wiki/Colonne_Vend￴me. I think you can see the difference. I know those URL are not RFC compliant, but they are in the wild, so I have to be able to manage them. So my questions are :
    Is there a way to force libcURL to understand those URL ? Is there a way to force the encoding ? Is there a way to work between request to encode the URL ?

  2. There is also an issue with URL-fragment or anchor (#). If the final URL contains a fragment, libcurl trims it before returning the answer. It make sense in an HTTP way, because no fragment should be send to the server, but of course I need these part. Not because an anchor is important, but because if this URL http://goo.gl/I8AYpW is elongated to https://groups.google.com/forum/ it's absolutely useless. So my questions :
    Is there a way to get the fragment at the end ? Is there a way to get the last requested URL (so, with the fragment) ? Once again, is there a way to work between requets to save the final fragment ?

  3. There are a few sites that don't work well with this kind of elongator. Like those sites :
    http://t.co/Gej1JY3sgf return a HTTP 301 with an empty response but works in a browser
    http://t.co/3Ek7U438Ee return a HTTP 303 but works in a browser
    http://tinyurl.com/lvyapao doesn't get elongated (as any tinyurl).
    Do you have any advices or hints on those ?

What am I looking for is doing good code. So I don't like stopgaps, but if there is no other solution, I will use those. If you tell me there is a better way of doing this than libcurl, I could scrap pycURL. But I can't scrap Python.

So, if you have anything, I'll take it. I have no idea of what to do now.

EDIT :

Finally, an update :

  1. For this one, there was a security issue on Twitter. I was trying to elongate t.co URLs, but Twitter wasn't returning the same URL if you were using wget/curl/etc. stuff versus HTTP/JS stuff. As it was a security issue, I won a bounty, but couldn't speak of it until a week ago : https://hackerone.com/reports/34084

  2. For this one, the answer below solved my issue. That's why he won it.

  3. There is no global solution for this one, as it has to be handled case by case.


Solution

  • This libcurl stuff does not looks like it is going to do the trick. I would use the package requests:

    import requests
    
    bla = requests.head("https://t.co/0u0Jb2Pw7k", allow_redirects=True)
    
    print(bla)
    print(bla.url)
    
    >> <Response [404]>
    >> https://fr.wikipedia.org/wiki/Colonne_Vend%EF%BF%B4me