python python-2.7 youtube web-crawler urllib2

Link of embed youtube video scraping

I'm trying to scrape a website: page I try to crawl. The data I'm trying to collect is the link of the youtube video embed in their page. The problem is when I use urllib2 I can't execute the js, so the link doesn't appear in my code:

response = OPENER.open("https://www.hopenglish.com/how-sugar-affects-the-brain?ref=category")
html_text = response.read() 
print html_text

Do I have a way to retrieve this link without using another library to scrape this website? (Almost all my crawler is already implemented, i just need the youtube link of the embed video)

Solution

After going through entire HTML response found the lead which gives the youtube video id in an inline javascript, which is inside a script tag.

part of HTML response (which gives video Id):

<script type="text/javascript" language="javascript">
                var vID = "lEXBxijQREo";
                var srt_name = "sugaraffectsbrain";
                var user_id = 0;
                var post_id = 8349;
                var share_link = 'https://www.hopenglish.com/how-sugar-affects-the-brain';
                var share_img_link = 'https://s3-ap-northeast-1.amazonaws.com/hopenglish/wp/wp-content/uploads/2014/10/how-sugar-affects-the-brain.jpg';
            </script>

From above HTML response, retrieve vID value using the regular expression as follows:

import urllib2
import re

response = urllib2.urlopen("https://www.hopenglish.com/how-sugar-affects-the-brain?ref=category")
html_text = response.read() 
# print html_text

m = re.search('vID = "(.*?)"', html_text)
print m.group(0)

which yields:

vID = "lEXBxijQREo"

you can append the vID value lEXBxijQREo to the youtube.com domain as follows:

https://www.youtube.com/watch?v=lEXBxijQREo