Search code examples
pythonpython-2.7youtubeweb-crawlerurllib2

Link of embed youtube video scraping


I'm trying to scrape a website: page I try to crawl. The data I'm trying to collect is the link of the youtube video embed in their page. The problem is when I use urllib2 I can't execute the js, so the link doesn't appear in my code:

response = OPENER.open("https://www.hopenglish.com/how-sugar-affects-the-brain?ref=category")
html_text = response.read() 
print html_text

Do I have a way to retrieve this link without using another library to scrape this website? (Almost all my crawler is already implemented, i just need the youtube link of the embed video)


Solution

  • After going through entire HTML response found the lead which gives the youtube video id in an inline javascript, which is inside a script tag.

    part of HTML response (which gives video Id):

    <script type="text/javascript" language="javascript">
                    var vID = "lEXBxijQREo";
                    var srt_name = "sugaraffectsbrain";
                    var user_id = 0;
                    var post_id = 8349;
                    var share_link = 'https://www.hopenglish.com/how-sugar-affects-the-brain';
                    var share_img_link = 'https://s3-ap-northeast-1.amazonaws.com/hopenglish/wp/wp-content/uploads/2014/10/how-sugar-affects-the-brain.jpg';
                </script>
    

    From above HTML response, retrieve vID value using the regular expression as follows:

    import urllib2
    import re
    
    response = urllib2.urlopen("https://www.hopenglish.com/how-sugar-affects-the-brain?ref=category")
    html_text = response.read() 
    # print html_text
    
    m = re.search('vID = "(.*?)"', html_text)
    print m.group(0)
    

    which yields:

    vID = "lEXBxijQREo"
    

    you can append the vID value lEXBxijQREo to the youtube.com domain as follows:

    https://www.youtube.com/watch?v=lEXBxijQREo