Search code examples
jquerynode.jsweb-scrapingjquery-selectors

Scraping subelements in HTML using jQuery?


I'm currently working on a script to scrape some very basic information from an HTML page. Specifically, I'm trying to get some information about artists from allmusic.com. I'm writing this script in node.js using jQuery to do the actual scraping, and have it working to a certain degree by using the examples from this blog post.

What I'm trying to do is to run a search on a popular artist, then store some basic information on the first result, which will almost entirely be the artist that I'm looking for. I'm able to extract the table in question using the code below, but I can't figure out how to get the first couple of td elements from the HTML, which is what I really need to do. My node.js code is as follows:

var request = require('request'),
    jsdom = require('jsdom');

request({ uri:'http://allmusic.com/search/artist/lady+gaga' }, function (error, response, body) {

  jsdom.env({
    html: body,
    scripts: [
      'http://code.jquery.com/jquery-1.5.min.js'
    ]
  }, function (err, window) {
    var $ = window.jQuery;

    // jQuery is now loaded on the jsdom window created from 'agent.body'
    var search = $('.search-results').html();
    if(search != null){
      //gah what can i do here?!?
    }
  });
});

Below is the block of HTML in question so that you don't need to go find it yourself:

<table class="search-results" border="0" cellpadding="0" cellspacing="0" width="100%">
   <tr>
      <th class="relevance">
          <a href="http://www.allmusic.com/search/artist/lady gaga/filter:all/exact:0/order:relevance-asc" title="order by relevance">Relevance</a>
      </th>
      <th width="10px">&nbsp;</th>

      <th>
         <a href="http://www.allmusic.com/search/artist/lady gaga/filter:all/exact:0/order:name-asc" title="order by name">Name</a>
      </th>
      <th width="75px">
          <a href="http://www.allmusic.com/search/artist/lady gaga/filter:all/exact:0/order:genre-asc" title="order by genre">Genre</a>
       </th>
       <th width="200px">Years Active</th>

    </tr>

           ACTUAL RELEVANT STUFF THAT I WANT ARE BELOW

    <tr>
       <td class="relevance text-center">
           <div class="bar" style="width:100%" title="100%"></div>
       </td>
       <td class="text-center"></td>
       <td><a href="http://www.allmusic.com/artist/lady-gaga-p1055684">Lady Gaga</a></td>

        <td>Pop/Rock</td>   //SPECIFICALLY THIS
        <td>00s</td>
    </tr>

There are many many more entries in this table, but this is the first result. Is it possible to create an array of the td's or something of that sort and just get the right index? It should be the same index for every single artist, assuming I'm always going to be getting the first result.

If this isn't possible, are there any other ways of achieving my goal? Alternatively, are there better ways of doing what I'm trying to do with node.js? I looked at a bunch of different options, and this just seemed to be the simplest.

Best, and Thanks,
Sami


Solution

  • You can use the .siblings() method to traverse the td elements.

    See: http://api.jquery.com/siblings/ You can also get all the td elements with JQuery that will return an array and use the index as you mentioned.

    The selector should be something like this:

    var tds= $('.search-results tr td');
    

    This will get all the tds in the table so you will have to multiply by the number of columns.

    var trs = $('.search-results tr');
    

    Remember that the first column contains the header and those are not in the tds variables.

    Hope this help.