Search code examples
pythoniframeurllib2

Python - ValueError: unknown url type


I'm trying to extract sources from <iframes> attributes like these:

   iframes =  [<iframe frameborder="no" height="160px" scrolling="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false&amp;visual=true" width="100%"></iframe>, <iframe allowtransparency="true" frameborder="0" scrolling="no" src="//www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&amp;width=300&amp;height=62&amp;show_faces=false&amp;colorscheme=light&amp;stream=false&amp;show_border=false&amp;header=false" style="border:none; overflow:hidden; width:300px; height:62px;"></iframe>, <iframe allowfullscreen="" frameborder="0" height="169" src="//www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1" width="100%"></iframe>]

but when I try to extract it:

for iframe in iframes:
    url = urllib2.urlopen(iframe.attrs['src'])
    print (url)

I get the following error:

   url = urllib2.urlopen(iframe.attrs['src'])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 423, in open
    protocol = req.get_type()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 285, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: //www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false

why am I getting url with no http before the //www?

Is there some workaround this?


Solution

  • why am I getting url with no http before the //www

    This is a common way to indicate to the user agent that it should use the same scheme (http, https, ftp, file, etc.) as the current page when making a subsequent request. So, for example, if you loaded the current page over https, then those URLs that omit the scheme would be accessed with https.

    Is there some workaround this?

    You can use the urlparse module to handle this in Python 2 (since that's your version of Python):

    # from urllib.parse import urlparse, urlunparse    # Python 3
    from urlparse import urlparse, urlunparse
    
    for iframe in iframes:
        scheme, netloc, path, params, query, fragment = urlparse(iframe.attrs['src'])
        if not scheme:
            scheme = 'http'    # default scheme you used when getting the current page
        url = urlunparse((scheme, netloc, path, params, query, fragment))
        print('Fetching {}'.format(url))
        f = urllib2.urlopen(url)
    #    print(f.read())    # dumps the response content
    

    If you run the above code you should see this output:

    Fetching https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true
    Fetching http://www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false
    Fetching http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1
    

    which shows that the default scheme has been applied to the URL.