Search code examples
pythonweb-scrapingyoutubeurllib2

Python data scraping


I want to download a couple songs off of http://www.youtube-mp3.org/. I'm using urllib2 and BeautifulSoup.

The problem is that when I urllib2 open the site with my video ID plugged in, http://www.youtube-mp3.org/?c#v=lV7r8PiuecQ, I get the site but they are tricky about it and load the info after the initial pageload with some js ajax stuff. So when I try to scrape the url of the download link, literally isn't on the page because it hasn't been loaded.

Anyone know how I can maybe trigger this js loader in my python script, or something?

Here is the relevant empty html BEFORE the content that I want is loaded into it.

<div id="link_box" style="display:none">
   <div id="link_box_title" style="font-weight:bold; text-decoration:underline">
   </div>
   <div class="row">
    <div id="link_box_bb_code_title" style="font-weight:bold">
    </div>
    <input type="text" id="BBCodeLink" onclick="sAll(this)" />
   </div>
   <div class="row">
    <div id="link_box_html_code_title" style="font-weight:bold">
    </div>
    <input type="text" id="HTMLLink" onclick="sAll(this)" />
   </div>
   <div class="row">
    <div id="link_box_direct_code_title" style="font-weight:bold">
    </div>
    <input type="text" id="DirectLink" onclick="sAll(this)" />
   </div>
  </div>
  <div id="v-ads">
  </div>
  <div id="dl_link">
  </div>
  <div id="progress">
  </div>
  <div id="loader">
   <img src="ajax-loader-b.gif" alt="loading.." width="16" height="11" />
  </div>
 </div>
 <div class="clear">
 </div>
</div>

Solution

  • The API is JSON-based, so the contents of the html files won't give you any clue on where to find the files. A good idea when exploring web services like this one, is to open the Network tab in Chrome's developer tools and see what pages it loads when interacting with the page. That exercise showed me that two urls in particular seem interesting:

    1. http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DKMU0tzLwhbE&xy=trve&r=1314700829128
    2. http://www.youtube-mp3.org/api/itemInfo/?video_id=KMU0tzLwhbE&adloc=&r=1314700829314

    The first url appears to be queuing a file for processing, the second to get the status of the processing job.

    The second url takes a video_id GET parameter that is the id for the video on youtube (http://www.youtube.com/watch?v=KMU0tzLwhbE) and returns the status of the decoding job. The second and third seem irrelevant for this purpose which you can verify by test loading the url with and without the extra parameters.

    The content of the page is:

    info = { "title" : "Developers", 
             "image" : "http://i4.ytimg.com/vi/KMU0tzLwhbE/default.jpg", 
             "length" : "3", "status" : "serving", "progress_speed" : "", 
             "progress" : "", "ads" : "", 
             "h" : "a0aa17294103c638fa7f5e0606f839d3" };
    

    Which happens to be JSON data. The interesting bit in this is "a0aa17294103c638fa7f5e0606f839d3" which looks like a hash that the web service use to refer to the decoded mp3 file. Also check out how the download link on the front page looks:

    http://www.youtube-mp3.org/get?video_id=KMU0tzLwhbE&h=a0aa17294103c638fa7f5e0606f839d3

    Now we have all the missing pieces of the puzzle together. First, we take the url of a youtube video (http://www.youtube.com/watch?v=iKP7DZmqdbU) url quote it and feed it to the api using this url:

    http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DiKP7DZmqdbU&xy=trve

    Then, wait a few moments until the decoding job is done:

    http://www.youtube-mp3.org/api/itemInfo/?video_id=iKP7DZmqdbU

    Take the hash found in the info url to construct the download url:

    http://www.youtube-mp3.org/get?video_id=iKP7DZmqdbU&h=2e4b61b6ddc8bf83f5a0e4e4ee0635bb

    Note that it is possible that the web master of the site does not want to be scraped and will take counter measures if people starts to (in the webmasters eyes) abuse the site. For example it seem to use referer protection so clicking the links in this post won't work, you have to copy them and load them in a new browser window.

    Test code:

    from re import findall
    from time import sleep
    from urllib import urlopen, quote
    
    yt_code = 'gijypDkEqUA'
    
    yt_url = 'http://www.youtube.com/watch?v=%s' % yt_code
    push_url_fmt = 'http://www.youtube-mp3.org/api/pushItem/?item=%s&xy=trve'
    info_url_fmt = 'http://www.youtube-mp3.org/api/itemInfo/?video_id=%s'
    download_url_fmt = 'http://www.youtube-mp3.org/get?video_id=%s&h=%s'
    push_url = push_url_fmt % quote(yt_url)
    data = urlopen(push_url).read()
    sleep(10)
    info_url = info_url_fmt % yt_code
    data = urlopen(info_url).read()
    res = findall('"h" : "([^"]*)"', data)
    download_url = download_url_fmt % (yt_code, res[0])
    print 'Download here:', download_url