I'm trying to extract sources from <iframes>
attributes like these:
iframes = [<iframe frameborder="no" height="160px" scrolling="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true" width="100%"></iframe>, <iframe allowtransparency="true" frameborder="0" scrolling="no" src="//www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false" style="border:none; overflow:hidden; width:300px; height:62px;"></iframe>, <iframe allowfullscreen="" frameborder="0" height="169" src="//www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1" width="100%"></iframe>]
but when I try to extract it:
for iframe in iframes:
url = urllib2.urlopen(iframe.attrs['src'])
print (url)
I get the following error:
url = urllib2.urlopen(iframe.attrs['src'])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 423, in open
protocol = req.get_type()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 285, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: //www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false
why am I getting url with no http
before the //www
?
Is there some workaround this?
why am I getting url with no http before the //www
This is a common way to indicate to the user agent that it should use the same scheme (http, https, ftp, file, etc.) as the current page when making a subsequent request. So, for example, if you loaded the current page over https, then those URLs that omit the scheme would be accessed with https.
Is there some workaround this?
You can use the urlparse
module to handle this in Python 2 (since that's your version of Python):
# from urllib.parse import urlparse, urlunparse # Python 3
from urlparse import urlparse, urlunparse
for iframe in iframes:
scheme, netloc, path, params, query, fragment = urlparse(iframe.attrs['src'])
if not scheme:
scheme = 'http' # default scheme you used when getting the current page
url = urlunparse((scheme, netloc, path, params, query, fragment))
print('Fetching {}'.format(url))
f = urllib2.urlopen(url)
# print(f.read()) # dumps the response content
If you run the above code you should see this output:
Fetching https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/308197184%3Fsecret_token%3Ds-VtArH&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&visual=true Fetching http://www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2FPauseMusicale&width=300&height=62&show_faces=false&colorscheme=light&stream=false&show_border=false&header=false Fetching http://www.youtube.com/embed/videoseries?list=PLNKCTdT9YSESoQnj5tPP4P9kaIwBCx7F1
which shows that the default scheme has been applied to the URL.