I'm trying to scrape a website: page I try to crawl. The data I'm trying to collect is the link of the youtube video embed in their page. The problem is when I use urllib2 I can't execute the js, so the link doesn't appear in my code:
response = OPENER.open("https://www.hopenglish.com/how-sugar-affects-the-brain?ref=category")
html_text = response.read()
print html_text
Do I have a way to retrieve this link without using another library to scrape this website? (Almost all my crawler is already implemented, i just need the youtube link of the embed video)
After going through entire HTML response found the lead which gives the youtube video id in an inline javascript, which is inside a script tag.
part of HTML response (which gives video Id):
<script type="text/javascript" language="javascript">
var vID = "lEXBxijQREo";
var srt_name = "sugaraffectsbrain";
var user_id = 0;
var post_id = 8349;
var share_link = 'https://www.hopenglish.com/how-sugar-affects-the-brain';
var share_img_link = 'https://s3-ap-northeast-1.amazonaws.com/hopenglish/wp/wp-content/uploads/2014/10/how-sugar-affects-the-brain.jpg';
</script>
From above HTML response, retrieve vID
value using the regular expression as follows:
import urllib2
import re
response = urllib2.urlopen("https://www.hopenglish.com/how-sugar-affects-the-brain?ref=category")
html_text = response.read()
# print html_text
m = re.search('vID = "(.*?)"', html_text)
print m.group(0)
which yields:
vID = "lEXBxijQREo"
you can append the vID value lEXBxijQREo
to the youtube.com domain as follows:
https://www.youtube.com/watch?v=lEXBxijQREo