Search code examples
javastringformatjsoup

Extracting data from HTML and formatting the output


Introduction

I am currently learning about WebScraping on my own as a personal project for acquiring new tricks and pure hobby.

So far I was able to extract data from a website (after studying a bit the structure) by using this code I made with Java and Jsoup library.

//To input the html file
   File inputFile = new File("test2.html");
   Document doc = Jsoup.parse(inputFile, "Unicode");
   
   //To grab the part we are working with (knowing the website for sure)
   Element content = doc.getElementById("mainContent");
   Elements tds = doc.select("[class=nowrap]");
   System.out.println(tds.text());
   

    (Note that I am working from a HTML file)

So far so good I got this "desired" output

 <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
 onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
 000]</td>
 <td align="right" class="nowrap">10 000</td>
 <td align="right" class="nowrap">20.48</td>
 <td align="right" class="nowrap">0.00</td>
 <td align="right" class="nowrap">$28.65</td>
 <td align="right" class="nowrap">0.00 %</td>
 <td align="right" class="nowrap">$894.69</td>
 <td align="right" class="nowrap">10.11</td>
 <td align="right" class="nowrap">0.21</td>
 <td align="right" class="nowrap"> <a href="website" onclick="return 
  doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
  onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
  000]</td>
  <td align="right" class="nowrap">10 000</td>
  <td align="right" class="nowrap">46.21</td>
  <td align="right" class="nowrap">0.00</td>
  <td align="right" class="nowrap">$53.82</td>
  <td align="right" class="nowrap">0.00 %</td>
  <td align="right" class="nowrap">$1 151.78</td>
  <td align="right" class="nowrap">8.01</td>
  <td align="right" class="nowrap">0.00</td>
  <td align="right" class="nowrap"> <a href="website" onclick="return 
 doWindow(this, 700, 500);" class="popup">0</a> </td>
 <td align="right" class="nowrap"><a href="website" 
  onclick="doWindow(this.href, '1024', '768'); return false;">5 000</a> [5 
  000]</td>
  <td align="right" class="nowrap">5 000</td>
  <td align="right" class="nowrap">22.51</td>
  <td align="right" class="nowrap">0.00</td>
  <td align="right" class="nowrap">$222.53</td>
  <td align="right" class="nowrap">0.00 %</td>
  <td align="right" class="nowrap">$2 399.92</td>
  <td align="right" class="nowrap">5.94</td>
  <td align="right" class="nowrap">0.01</td>

Problem

I am more interested in the text that contains (the numbers to be exact (Strings)) to do some math afterwards.

So I continued reading the Documentation about Jsoup and found out that I could use .text() to get rid of the HTML stuff obtaining a long string of the numbers from the HTML file as such:

0 10 000 [10 000] 10 000 20.48 0.00 $28.65 0.00 % $894.69 10.11 0.21 0 10 
000 [10 000] 10 000 46.21 0.00 $53.82 0.00 % $1 151.78 8.01 0.00 0 5 000 [5 
000] 5 000 22.51 0.00 $222.53 0.00 % $2 399.92 5.94 0.01

How do I separate it in 3 strings and be able to use the numbers?

One approach might be RegEx as I saw in other questions but still can't get the desired result.

EDIT: Some progress made

After some research I found out the way to convert to text AND access the data I wanted by doing:

tds.get(key).text(); 

Where key is a int number referring to the position in the last String obtained

This solved part of my question as there is one attribute from the HTML which I am not able to get it.

<td align="center">
        <input type="text" tabindex="2" name="productData[price]       
        [{33013477}]" size="10" value="3000.00">    
</td>

Where the value that I need is at the attribute value="3000.0"

Thanks for the interest in this question.


Solution

  • To scrape data from HTML source I use a little method I created named getBetween() to carry out the task. Of course the data I personally want seems to always be between strings of some sort:

    /**
     * Retrieves any string data located between the supplied string leftString
     * parameter and the supplied string rightString parameter.<br><br>
     * <p>
     * <p>
     * This method will return all instances of a substring located between the
     * supplied Left String and the supplied Right String which may be found
     * within the supplied Input String.<br>
     *
     * @param inputString (String) The string to look for substring(s) in.
     *
     * @param leftString  (String) What may be to the Left side of the substring
     *                    we want within the main input string. Sometimes the
     *                    substring you want may be contained at the very
     *                    beginning of a string and therefore there is no
     *                    Left-String available. In this case you would simply
     *                    pass a Null String ("") to this parameter which
     *                    basically informs the method of this fact. Null can
     *                    not be supplied and will ultimately generate a
     *                    NullPointerException.
     *
     * @param rightString (String) What may be to the Right side of the
     *                    substring we want within the main input string.
     *                    Sometimes the substring you want may be contained at
     *                    the very end of a string and therefore there is no
     *                    Right-String available. In this case you would simply
     *                    pass a Null String ("") to this parameter which
     *                    basically informs the method of this fact. Null can
     *                    not be supplied and will ultimately generate a
     *                    NullPointerException.
     *
     * @param options     (Optional - Boolean - 2 Parameters):<pre>
     *
     *      ignoreLetterCase    - Default is false. This option works against the
     *                            string supplied within the leftString parameter
     *                            and the string supplied within the rightString
     *                            parameter. If set to true then letter case is
     *                            ignored when searching for strings supplied in
     *                            these two parameters. If left at default false
     *                            then letter case is not ignored.
     *
     *      trimFound           - Default is true. By default this method will trim
     *                            off leading and trailing white-spaces from found
     *                            sub-string items. General sentences which obviously
     *                            contain spaces will almost always give you a white-
     *                            space within an extracted sub-string. By setting
     *                            this parameter to false, leading and trailing white-
     *                            spaces are not trimmed off before they are placed
     *                            into the returned Array.</pre>
     *
     * @return (1D String Array) Returns a Single Dimensional String Array
     *         containing all the sub-strings found within the supplied Input
     *         String which are between the supplied Left String and supplied
     *         Right String. You can shorten this method up a little by
     *         returning a List&lt;String&gt; ArrayList and removing the 'List
     *         to 1D Array' conversion code at the end of this method. This
     *         method initially stores its findings within a List object
     *         anyways.
     */
    public String[] getBetween(String inputString, String leftString, 
                        String rightString, boolean... options) {
        // Return nothing if nothing was supplied.
        if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
            return null;
        }
    
        // Prepare optional parameters if any supplied.
        // If none supplied then use Defaults...
        boolean ignoreCase = false; // Default.
        boolean trimFound = true;   // Default.
        if (options.length > 0) {
            if (options.length >= 1) {
                ignoreCase = options[0];
            }
            if (options.length >= 2) {
                trimFound = options[1];
            }
        }
    
        // Remove any ASCII control characters from the
        // supplied string (if they exist).
        String modString = inputString.replaceAll("\\p{Cntrl}", "");
    
        // Establish a List String Array Object to hold
        // our found substrings between the supplied Left
        // String and supplied Right String.
        List<String> list = new ArrayList<>();
    
        // Use Pattern Matching to locate our possible
        // substrings within the supplied Input String.
        String regEx = Pattern.quote(leftString)
                + (!rightString.equals("") ? "(.*?)" : "(.*)?")
                + Pattern.quote(rightString);
        if (ignoreCase) {
            regEx = "(?i)" + regEx;
        }
        Pattern pattern = Pattern.compile(regEx);
        Matcher matcher = pattern.matcher(modString);
        while (matcher.find()) {
            // Add the found substrings into the List.
            String found = matcher.group(1);
            if (trimFound) {
                found = found.trim();
            }
            list.add(found);
        }
    
        String[] res;
        // Convert the ArrayList to a 1D String Array.
        // If the List contains something then convert
        if (list.size() > 0) {
            res = new String[list.size()];
            res = list.toArray(res);
        } // Otherwise return Null.
        else {
            res = null;
        }
        // Return the String Array.
        return res;
    }
    

    Getting the web-page HTML source is the easy part. To get the needed numerical values from the "desired output" you initially posted (shown below)

    HTML Source:

     <td align="right" class="nowrap"> <a href="website" onclick="return 
     doWindow(this, 700, 500);" class="popup">0</a> </td>
     <td align="right" class="nowrap"><a href="website" 
     onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
     000]</td>
     <td align="right" class="nowrap">10 000</td>
     <td align="right" class="nowrap">20.48</td>
     <td align="right" class="nowrap">0.00</td>
     <td align="right" class="nowrap">$28.65</td>
     <td align="right" class="nowrap">0.00 %</td>
     <td align="right" class="nowrap">$894.69</td>
     <td align="right" class="nowrap">10.11</td>
     <td align="right" class="nowrap">0.21</td>
     <td align="right" class="nowrap"> <a href="website" onclick="return 
      doWindow(this, 700, 500);" class="popup">0</a> </td>
     <td align="right" class="nowrap"><a href="website" 
     onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10 
     000]</td>
     <td align="right" class="nowrap">10 000</td>
     <td align="right" class="nowrap">46.21</td>
     <td align="right" class="nowrap">0.00</td>
     <td align="right" class="nowrap">$53.82</td>
     <td align="right" class="nowrap">0.00 %</td>
     <td align="right" class="nowrap">$1 151.78</td>
     <td align="right" class="nowrap">8.01</td>
     <td align="right" class="nowrap">0.00</td>
     <td align="right" class="nowrap"> <a href="website" onclick="return 
     doWindow(this, 700, 500);" class="popup">0</a> </td>
     <td align="right" class="nowrap"><a href="website" 
     onclick="doWindow(this.href, '1024', '768'); return false;">5 000</a> [5 
     000]</td>
     <td align="right" class="nowrap">5 000</td>
     <td align="right" class="nowrap">22.51</td>
     <td align="right" class="nowrap">0.00</td>
     <td align="right" class="nowrap">$222.53</td>
     <td align="right" class="nowrap">0.00 %</td>
     <td align="right" class="nowrap">$2 399.92</td>
     <td align="right" class="nowrap">5.94</td>
     <td align="right" class="nowrap">0.01</td>
     <td align="right" class="nowrap"> <a href="website" onclick="return
     <td align="center">
         <input type="text" tabindex="2" name="productData[price]       
         [{33013477}]" size="10" value="3000.00">    
     </td> 
    

    I would use the getBetween() method something like this:

    // Let's assume the "desired output" you acquired 
    // is contained within a Text file named "HtmlData.txt".
    
    // Hold our scraped data in a 2D List inteface.
    List<List<String>> list = new ArrayList<>();
    
    // Read File using BufferedReader in a Try With Resources block...
    try (BufferedReader reader = new BufferedReader(new FileReader("HtmlData.txt"))) {
        String line;
        List<String> numbers = null;
        while ((line = reader.readLine()) != null) {
            numbers = new ArrayList<>();
            line = line.trim();
            if (line.equals("")) {
                continue;
            }
            if (line.startsWith("onclick=\"doWindow(this.href,")) {
                while ((line = reader.readLine()) != null) {
                    line = line.trim();
                    if (line.endsWith("return")) {
                        list.add(numbers);
                        break;
                    }
                    if (line.equals("")) {
                        continue;
                    }
                    if (line.startsWith("<td align=\"right\" class=\"nowrap\">")) {
                        numbers.add(getBetween(line, "<td align=\"right\" class=\"nowrap\">", "</td>", true, true)[0]);
                    }
                }
            }
            if (line.contains("name=\"productData[price]")) {
                while ((line = reader.readLine()) != null) {
                    line = line.trim();
                    if (line.equals("")) {
                        continue;
                    }
                    if (line.startsWith("[{33013477}]")) {
                        numbers.add("Product Price: " + getBetween(line, "value=\"", "\">", true, true)[0]);
                        list.add(numbers);
                        break;  // DONE
                    }
                }
            }
        }
        if (numbers != null && !numbers.isEmpty()) {
            list.add(numbers);
        }
    }
    catch (IOException ex) {
        ex.printStackTrace();
    }
    
    // Display our findings to the Console Window in a 
    // table style format:
    for (int i = 0; i < list.size(); i++) {
        for (int j = 0; j < list.get(i).size(); j++) {
            System.out.printf("%-10s ", list.get(i).get(j));
        }
        System.out.println("");
    }
    

    In case you didn't notice, the other portion you desire from the lines:

    <td align="center">
        <input type="text" tabindex="2" name="productData[price]       
        [{33013477}]" size="10" value="3000.00">    
    </td>
    

    was also contained within the file data. When the code is run you will see the following displayed within the Console Window:

    10 000     20.48      0.00       $28.65     0.00 %     $894.69    10.11      0.21       
    10 000     46.21      0.00       $53.82     0.00 %     $1 151.78  8.01       0.00       
    5 000      22.51      0.00       $222.53    0.00 %     $2 399.92  5.94       0.01       
    Product Price: 3000.00