Python - ValueError: unknown url type

I'm trying to extract sources from <iframes> attributes like these:

   iframes =  [<iframe frameborder="no" height="160px" scrolling="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false&amp;visual=true" width="100%"></iframe>, <iframe allowtransparency="true" frameborder="0" scrolling="no" src="//www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&amp;width=300&amp;height=62&amp;show_faces=false&amp;colorscheme=light&amp;stream=false&amp;show_border=false&amp;header=false" style="border:none; overflow:hidden; width:300px; height:62px;"></iframe>, <iframe allowfullscreen="" frameborder="0" height="169" src="//www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1" width="100%"></iframe>]

but when I try to extract it:

for iframe in iframes:
    url = urllib2.urlopen(iframe.attrs['src'])
    print (url)

I get the following error:

   url = urllib2.urlopen(iframe.attrs['src'])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 423, in open
    protocol = req.get_type()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 285, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: //www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false

why am I getting url with no http before the //www?

Is there some workaround this?

Solution

why am I getting url with no http before the //www

This is a common way to indicate to the user agent that it should use the same scheme (http, https, ftp, file, etc.) as the current page when making a subsequent request. So, for example, if you loaded the current page over https, then those URLs that omit the scheme would be accessed with https.

Is there some workaround this?

You can use the urlparse module to handle this in Python 2 (since that's your version of Python):

# from urllib.parse import urlparse, urlunparse    # Python 3
from urlparse import urlparse, urlunparse

for iframe in iframes:
    scheme, netloc, path, params, query, fragment = urlparse(iframe.attrs['src'])
    if not scheme:
        scheme = 'http'    # default scheme you used when getting the current page
    url = urlunparse((scheme, netloc, path, params, query, fragment))
    print('Fetching {}'.format(url))
    f = urllib2.urlopen(url)
#    print(f.read())    # dumps the response content

If you run the above code you should see this output:

Fetching https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true
Fetching http://www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false
Fetching http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1

which shows that the default scheme has been applied to the URL.