Search code examples
androidweb-scrapingjsoup

Extract correctly info from table with JSoup in Android


I am trying to extract some info from an HTML table and put them for example into an arraylist = new ArrayList<HashMap<String, String>>(); for manage better inside my app.

I was already able to get the right HTML page saved in my document variable after a post request. The following is the piece of HTML containing my useful data, but it is not the only table inside the page. I do not know how to find items into this specific table.

What would be the right approach to get data in this format: DAY - TIME - SUGGESTION ?

Thank you very munch in advance for any advice!

<table><tbody>
<tr><th class="date">Wed, 14 Sep 2016</th><th></th><th></th></tr>
<tr><td>&nbsp;</td><td class="sub">09:00</td><td class="sugg">Depart and set your watch to the arrival city&#39;s time zone (03:00). Sleep as needed. The following times are in the arrival city&#39;s time zone.</td></tr>
<tr><td>&nbsp;</td><td class="sub">18:30</td><td class="sugg">Arrive</td></tr>
<tr><td>&nbsp;</td><td class="sub">19:00&ndash;22:00</td><td class="sugg">Seek light</td></tr>
<tr><td>&nbsp;</td><td class="sub">22:00&ndash;23:00</td><td class="sugg">Avoid light before bed</td></tr>
<tr><td>&nbsp;</td><td class="sub">23:00&ndash;07:00</td><td class="sugg">Sleep ideal</td></tr>
<tr><th class="date">Thu, 15 Sep 2016</th><th></th><th></th></tr>
<tr><td>&nbsp;</td><td class="sub">20:00&ndash;23:00</td><td class="sugg">Seek light before bed</td></tr>
<tr><td>&nbsp;</td><td class="sub">23:00&ndash;07:00</td><td class="sugg">Sleep ideal</td></tr>
<tr><th class="date">Fri, 16 Sep 2016</th><th></th><th></th></tr>
<tr><td>&nbsp;</td><td class="sub">20:00&ndash;23:00</td><td class="sugg">Seek light before bed</td></tr>
<tr><td>&nbsp;</td><td class="sub">23:00&ndash;07:00</td><td class="sugg">Sleep ideal</td></tr>
</tbody></table>

EDIT

The loop I think is the way I want to implement. I am getting closer to the solution. I need to find a way to detect if the currently row I'm inspecting in the loop has th or td cells:

//find the table, it is the second table in the HTML
Element table = document.select("tbody").get(1);

//get all the rows
Elements rows = table.select("tr");

//loop the rows
for (Element row : rows) {

    //if the row contains th, I get the first cell and save day in a string

    //if the row contains td, I get the second (time) and third (suggestion) cells and put in my map string with day, time, suggestion

}

Solution

  • Well I figured out a solution, maybe not the best style coded, but it works :) (Engineer: "If it works, it is good")

    I have a moderate knowlege of coding in some languages, but this was the first time I had to deal with parsing and consequently JSoup. It is not a so immediate tool to understand, but in my research I noticed it is very powerful. I put it in my personal to-learn list.

    Note: this approach assumes there is always th row before of td row.

    This is my solution:

            String day = null;
            String time;
            String sugg;
    
            //crop the page in order to leave the table I needed, since it was without a specific id, I selected it as the second table in the page
            Element table = document.select("tbody").get(1);
    
            //this is the list of all the row in the table
            Elements rows = table.select("tr");
    
            //here I cycle the rows
            for (Element row : rows) {
    
                HashMap<String, String> map = new HashMap<String, String>();
    
    
                //if the row contains th elements, I store the first th of the row as day
                if (!row.select("th").isEmpty())
                {
                    day = row.select("th").get(0).text();
                }
    
                //if the row contains td elements, I store the second and third td in strings and put all in map
                if (!row.select("td").isEmpty())
                {
                    time = row.select("td").get(1).text();
                    sugg = row.select("td").get(2).text();
    
                    Log.d("row: ", day + " " + time + " " + sugg);
    
                    map.put("day", day);
                    map.put("time", time);
                    map.put("sugg", sugg);
                }
    
                arraylist.add(map);
            }