Search code examples
javahtmlparsingjsouphtml-parsing

Using Java jsoup to parse a html page and store data


I am trying to use jsoup libraries to parse a html file and get all the data relating to table class="scl_list" as below, which is only a small part of the html page.

<table class="scl_list">
        <tr>
            <th align="center">Id:</th>
            <th align="center">Name:</th>
            <th align="center">Serial:</th>
            <th align="center">Status:</th>
            <th align="center">Ladestrom:</th>
            <th align="center">Z&auml;hleradresse:</th>
            <th align="center">Z&auml;hlerstand:</th>
        </tr>
        <tr>
            <th align="center">7</th>
            <th align="center">7</th>
            <th align="center">c3001c0020333347156a66</th>
            <th align="center">Idle</th>
            <th align="center">16.0</th>
            <th align="center">40100021</th>
            <th align="center">12464.25</th>
        </tr>
        <tr>
            <th align="center">21</th>
            <th align="center">21</th>
            <th align="center">c3002a003c343551086869</th>
            <th align="center">Idle</th>
            <th align="center">16.0</th>
            <th align="center">540100371</th>
            <th align="center">1219.73</th>
        </tr>
    </table>

For every <tr> , I then need to get every <th> and save the data in a table or vector. Unfortunately I can't find many examples using jsoup which does something similar.

So far I have this, where html_string is my html page, but I'm not sure how to progress. Any help is much appreciated :

Document doc = Jsoup.parse(html_string);
Elements els = doc.getElementsContainingText("table class=\"scl_list\"");

Solution

  • Jsoup is a simple and intuitive library. You can find many examples online how to read html tables. Look at the documentation under jsoup cookbook and especially the selector-syntax. To get back to your question, an easy way would be the following:

    public static void main(String[] args) {
        String html =   "<table class=\"scl_list\">\n" +
                        "        <tr>\n" +
                        "            <th align=\"center\">Id:</th>\n" +
                        "            <th align=\"center\">Name:</th>\n" +
                        "            <th align=\"center\">Serial:</th>\n" +
                        "            <th align=\"center\">Status:</th>\n" +
                        "            <th align=\"center\">Ladestrom:</th>\n" +
                        "            <th align=\"center\">Z&auml;hleradresse:</th>\n" +
                        "            <th align=\"center\">Z&auml;hlerstand:</th>\n" +
                        "        </tr>\n" +
                        "        <tr>\n" +
                        "            <th align=\"center\">7</th>\n" +
                        "            <th align=\"center\">7</th>\n" +
                        "            <th align=\"center\">c3001c0020333347156a66</th>\n" +
                        "            <th align=\"center\">Idle</th>\n" +
                        "            <th align=\"center\">16.0</th>\n" +
                        "            <th align=\"center\">40100021</th>\n" +
                        "            <th align=\"center\">12464.25</th>\n" +
                        "        </tr>\n" +
                        "        <tr>\n" +
                        "            <th align=\"center\">21</th>\n" +
                        "            <th align=\"center\">21</th>\n" +
                        "            <th align=\"center\">c3002a003c343551086869</th>\n" +
                        "            <th align=\"center\">Idle</th>\n" +
                        "            <th align=\"center\">16.0</th>\n" +
                        "            <th align=\"center\">540100371</th>\n" +
                        "            <th align=\"center\">1219.73</th>\n" +
                        "        </tr>\n" +
                        "    </table>";
        Document doc = Jsoup.parse(html);
        Elements trs = doc.select("table.scl_list tr");
        List<List<String>> data = new ArrayList<>();
        for(Element tr : trs){
            List<String> row = tr.select("th").stream().map(e -> e.text())
                                    .collect(Collectors.toList());
            data.add(row);
        }
        data.forEach(System.out::println);
    }
    

    The output should be something like:

    [Id:, Name:, Serial:, Status:, Ladestrom:, Zähleradresse:, Zählerstand:]
    [7, 7, c3001c0020333347156a66, Idle, 16.0, 40100021, 12464.25]
    [21, 21, c3002a003c343551086869, Idle, 16.0, 540100371, 1219.73]
    

    Since the first element seems to contain only the table heading, you can skip it by using a simple for loop and starting from the second element.

    Since I assume that your data represents electricity meters, I would recommend you to implement a small class as data container, which could look like this

    class Meter{
        int id;
        String name;
        String serial;
        String status;
        double chargingCurrent;
        String address;
        double  meterReading;
    
        public Meter(List<String> data) {
            this.id = Integer.parseInt(data.get(0));
            this.name = data.get(1);            
            this.serial = data.get(2);
            this.status = data.get(3);
            this.chargingCurrent = Double.parseDouble(data.get(4));
            this.address = data.get(5);
            this.meterReading = Double.parseDouble(data.get(6));
        }
        // getters & setters
    }
    

    The code from above could then be rewrittten to something like:

    Document doc = Jsoup.parse(html);
    Elements trs = doc.select("table.scl_list tr");
    List<Meter> meters = new ArrayList<>();
    for(int i = 1; i< trs.size(); i++){
        List<String> row = trs.get(i).select("th").stream().map(e -> e.text())
                                .collect(Collectors.toList());
        meters.add(new Meter(row));
    } 
    meters.forEach(System.out::println);
    

    with a corresponding toString method the output will look like:

    Meter{id=7, name=7, serial=c3001c0020333347156a66, status=Idle, chargingCurrent=16.0, address=40100021, meterReading=12464.25}
    Meter{id=21, name=21, serial=c3002a003c343551086869, status=Idle, chargingCurrent=16.0, address=540100371, meterReading=1219.73}