Search code examples
ruby-on-railsrubyhtml-parsingnokogiri

How to parse a HTML table with Nokogiri?


I'm trying to parse a table but I don't know how to save the data from it. I want to save the data in each row row to look like:

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]

The sample table is:

html = <<EOT
    <table class="open">
        <tr>
            <th>Table name</th>
            <th>Column name 1</th>
            <th>Column name 2</th>
            <th>Column name 3</th>
            <th>Column name 4</th>
            <th>Column name 5</th>
        </tr>
        <tr>
            <th>Raw name 1</th>
            <td>2,094</td>
            <td>0,017</td>
            <td>0,098</td>
            <td>0,113</td>
            <td>0,452</td>         
        </tr>
        .
        .
        .
        <tr>
            <th>Raw name 5</th>
            <td>2,094</td>
            <td>0,017</td>
            <td>0,098</td>
            <td>0,113</td>
            <td>0,452</td>         
        </tr>
    </table>
EOT

My scraper's code is:

  doc = Nokogiri::HTML(open(html), nil, 'UTF-8')
  tables = doc.css('div.open')

  @tablesArray = []

  tables.each do |table|
    title = table.css('tr[1] > th').text
    cell_data = table.css('tr > td').text
    raw_name = table.css('tr > th').text
    @tablesArray << Table.new(cell_data, raw_name)
  end

  render template: 'scrape_krasecology'
  end
  end

When I try to display the data in the HTML page it looks like all the column names are stored in one array's element and all the data the same way.


Solution

  • The key of the problem is that calling #text on multiple results will return the concatenation of the #text of each individual element.

    Lets examine what each step does:

    # Finds all <table>s with class open
    # I'm assuming you have only one <table> so
    #  you don't actually have to loop through
    #  all tables, instead you can just operate
    #  on the first one. If that is not the case,
    #  you can use a loop the way you did
    tables = doc.css('table.open')
    
    # The text of all <th>s in <tr> one in the table
    title = table.css('tr[1] > th').text
    
    # The text of all <td>s in all <tr>s in the table
    # You obviously wanted just the <td>s in one <tr>
    cell_data = table.css('tr > td').text
    
    # The text of all <th>s in all <tr>s in the table
    # You obviously wanted just the <th>s in one <tr>
    raw_name = table.css('tr > th').text
    

    Now that we know what is wrong, here is a possible solution:

    html = <<EOT
        <table class="open">
            <tr>
                <th>Table name</th>
                <th>Column name 1</th>
                <th>Column name 2</th>
                <th>Column name 3</th>
                <th>Column name 4</th>
                <th>Column name 5</th>
            </tr>
            <tr>
                <th>Raw name 1</th>
                <td>1001</td>
                <td>1002</td>
                <td>1003</td>
                <td>1004</td>
                <td>1005</td>         
            </tr>
            <tr>
                <th>Raw name 2</th>
                <td>2001</td>
                <td>2002</td>
                <td>2003</td>
                <td>2004</td>
                <td>2005</td>         
            </tr>
            <tr>
                <th>Raw name 3</th>
                <td>3001</td>
                <td>3002</td>
                <td>3003</td>
                <td>3004</td>
                <td>3005</td>         
            </tr>
        </table>
    EOT
    

    doc = Nokogiri::HTML(html, nil, 'UTF-8')
    
    # Fetches only the first <table>. If you have
    #  more than one, you can loop the way you
    #  originally did.
    table = doc.css('table.open').first
    
    # Fetches all rows (<tr>s)
    rows = table.css('tr')
    
    # The column names are the first row (shift returns
    #  the first element and removes it from the array).
    # On that row we get the text of each individual <th>
    # This will be Table name, Column name 1, Column name 2...
    column_names = rows.shift.css('th').map(&:text)
    
    # On each of the remaining rows
    text_all_rows = rows.map do |row|
    
      # We get the name (<th>)
      # On the first row this will be Raw name 1
      #  on the second - Raw name 2, etc.
      row_name = row.css('th').text
    
      # We get the text of each individual value (<td>)
      # On the first row this will be 1001, 1002, 1003...
      #  on the second - 2001, 2002, 2003... etc
      row_values = row.css('td').map(&:text)
    
      # We map the name, followed by all the values
      [row_name, *row_values]
    end
    
    p column_names  # => ["Table name", "Column name 1", "Column name 2",
                    #     "Column name 3", "Column name 4", "Column name 5"]
    p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
                    #     ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
                    #     ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]
    
    # If you want to combine them
    text_all_rows.each do |row_as_text|
      p column_names.zip(row_as_text).to_h
    end # =>
        # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
        # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
        # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}