Search code examples
javahtmljsoup

Parse complicated HTML table to 2D array with Java 8


I need to parse an HTML table. Some TD tags use both rowspan and colspan.

I referenced Extract data from complex HTML tables to 2d array in Java, but it causes IndexOutOfBoundsException Linked Answer's code like row.add(idx, rowspan - 1 == 0 ? (Element) td.removeAttr("rowspan") : td.attr("rowspan", String.valueOf(rowspan - 1)));

Any idea for both user rowspan, colspan to parse 2d array?

<table border="1">
  <tr>
    <td rowspan="3" colspan="2">head1</td>
    <td rowspan="2" colspan="2">head2</td>
    <td colspan="6">head3</td>
  </tr>
  <tr>
    <td colspan="2">head3-1</td>
    <td colspan="2">head3-2</td>
    <td colspan="2">head3-3</td>
  </tr>
  <tr>
    <td>sub_header1</td>
    <td>sub_header2</td>
    <td>sub_header3</td>
    <td>sub_header4</td>
    <td>sub_header5</td>
    <td>sub_header6</td>
    <td>sub_header7</td>
    <td>sub_header8</td>
  </tr>
  <tr>
    <td>content1</td>    
    <td>content2</td>    
    <td>content3</td>    
    <td>content4</td>    
    <td>content5</td>    
    <td>content6</td>    
    <td>content7</td>    
    <td>content8</td>    
    <td>content9</td>    
    <td>content10</td>
  </tr>
</table>


Solution

  • If you get index out of bounds, you need to handle that the row list size might not always match the index at which you're attempting to insert the modified td element.

    For example previous rows might have fewer cells (due to colspan or lack of rowspan usage) resulting in you adding elements to the list at an index that might exceed the list's current size.

    You can test and fill:

    if (idx < row.size()) {
      row.set(idx, rowspan - 1 == 0 ? td.removeAttr("rowspan") : td.attr("rowspan", String.valueOf(rowspan - 1)));
    } else {
      // You can append null or a placeholder element up to idx if you want to preserve the layout.
      while (row.size() < idx) {
        row.add(null); 
      }
      row.add(rowspan - 1 == 0 ? td.removeAttr("rowspan") : td.attr("rowspan", String.valueOf(rowspan - 1)));
    }