Search code examples
rubycsvweb-scrapingnokogiriwatir

When web scraping with Watir, how do I parse results in same class and enter them into separate CSV cells?


I'm using Watir to scrape the search results from a website and enter them into a CSV file. When I run a search, the results come divided into span classes. So the HTML will look something like:

<span class="sn_auth_name">foo</span>
<span class="sn_target_lang">English</span>

and my code looks like:

sn_auth_name   = row.xpath('span[@class="sn_auth_name"]/text()').text.strip
sn_target_lang = row.xpath('span[@class="sn_target_lang"]/text()').text.strip

CSV.open("file.csv", "a") do |csv|
        csv << [sn_auth_name, sn_target_lang]

The issue is that, for some of the search results, there are multiple items assigned to the same class. That is, sometimes there is only one sn_auth_name, and sometimes there are three! Right now, both results end up crammed into the same cell in my CSV file.

Is there a way that I can handle occasionally getting more than one result assigned to the same class? A solution where the second (or third) result is entered into a separate cell?

Thanks!


Someone has asked for more details, so here's the output I normally get.

<table class="restable"><tr>
<td class="res1">1/1</td>
<td class="res2">
    <span class="sn_auth_name">Imām</span>, 
    <span class="sn_auth_firstname">Abū Bakr</span>:
    <span class="sn_target_title">Al-Kalām rasmāl</span> [
    <span class="sn_target_lang">Arabic</span>]/ 
    <span class="sn_transl_name">Ḥijāzī al-Sayyid</span>, 
    <span class="sn_transl_firstname">Muṣṭafā</span> /
    <span class="sn_pub">
      <span class="place">Al-Qāhirah</span>: 
      <span class="publisher">Al-Majlis al-Alā lil-Thaqāfah</span> [
      <span class="sn_country">Egypt</span>]</span>,
    <span class="sn_year">2000</span>.
    <span class="sn_pagination">588 p.</span>
    <span class="sn_orig_title">Magana jarice</span> [
    <span class="sn_orig_lang">Afrikaans</span>]
</td></tr>
</table>

This is no problem to scrape because all of there is one class type for every piece of text I want to capture. But every so often, I get a result like this:

<tr>
<td class="res1">7/8</td>
<td class="res2">
    <span class="sn_auth_name">Plenge</span>, 
    <span class="sn_auth_firstname">Vagn</span>;
    <span class="sn_auth_name">Wyk</span>, 
    <span class="sn_auth_firstname">Chris van</span>:
    <span class="sn_target_title">Opbrud</span> [
    <span class="sn_target_lang">Danish</span>] / 
    <span class="sn_transl_name">Hansen</span>, 
    <span class="sn_transl_firstname">Finn Holten</span>;
    <span class="sn_transl_name">Madelung</span>, 
    <span class="sn_transl_firstname">Marianne</span>;
    <span class="sn_transl_name">Seiketso</span>, 
    <span class="sn_transl_firstname">Helen Gaohenngwe</span> /
    <span class="sn_pub">
      <span class="place">Frederiksberg</span>: 
      <span class="publisher">AKS</span>,
      <span class="place">Frederiksberg</span>: 
      <span class="publisher">Hjulet</span> [
      <span class="sn_country">Denmark</span>]</span>,
    <span class="sn_year">2000</span>.
    <span class="sn_pagination">247 p.</span> [
    <span class="sn_orig_lang">Afrikaans</span>], [
    <span class="sn_orig_lang">English</span>]
</td></tr>

Here, for example, there are multiple entries for sn_auth_name. And what ends up in my CSV file is a cell with PlengeWyk. The ideal would be to have the script create a sn_auth_name2 value and record it in a separate cell, i.e. Plenge and Wyk.

Any thoughts?


Solution

  • The #xpath method returns a NodeSet, which is a collection of matching nodes. The NodeSet includes Enumerable, which provides a number of methods for iterating over the collection. Rather than getting the text of the entire node set, you want to iterate over each node and collect its text.

    sn_auth_name = row.xpath('span[@class="sn_auth_name"]').map { |node| node.text.strip }
    #=> ["Plenge", "Wyk"]
    

    As an Array of names, sn_auth_name will still get written to the CSV in a single cell. If you want each name written into its own cell, you will need to flatten the Array. You can either flatten the individual column using a splat:

    csv << [*sn_auth_name, sn_target_lang]
    

    If there are multiple to flatten, you can also flatten the whole array:

    csv << [sn_auth_name, sn_target_lang].flatten
    

    Doing the above will mean that each row has a different number of columns. You can pad all of the rows so that they have the same number of columns:

    # Variable to define which column is the first name column
    col_auth_name = 0
    
    # Collect the data from the table into an Array
    data = []
    doc.css('td.res2').each do |row|
      sn_auth_name = row.xpath('span[@class="sn_auth_name"]').map { |node| node.text.strip }
      sn_target_lang = row.xpath('span[@class="sn_target_lang"]/text()').text.strip
      data << [sn_auth_name, sn_target_lang]
    end
    
    # Determine max number of names in a row
    max_auth_name = data.map { |row| row[col_auth_name].length }.max
    
    CSV.open("file.csv", "a") do |csv|
      data.each do |row|
        # Fill the Array of names to meet the max length
        row[col_auth_name].fill('', row[col_auth_name].length..(max_auth_name - 1))
    
        # Write to the CSV file
        csv << row.flatten
      end
    end