Introduction
I am currently learning about WebScraping on my own as a personal project for acquiring new tricks and pure hobby.
So far I was able to extract data from a website (after studying a bit the structure) by using this code I made with Java and Jsoup library.
//To input the html file
File inputFile = new File("test2.html");
Document doc = Jsoup.parse(inputFile, "Unicode");
//To grab the part we are working with (knowing the website for sure)
Element content = doc.getElementById("mainContent");
Elements tds = doc.select("[class=nowrap]");
System.out.println(tds.text());
(Note that I am working from a HTML file)
So far so good I got this "desired" output
<td align="right" class="nowrap"> <a href="website" onclick="return
doWindow(this, 700, 500);" class="popup">0</a> </td>
<td align="right" class="nowrap"><a href="website"
onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10
000]</td>
<td align="right" class="nowrap">10 000</td>
<td align="right" class="nowrap">20.48</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap">$28.65</td>
<td align="right" class="nowrap">0.00 %</td>
<td align="right" class="nowrap">$894.69</td>
<td align="right" class="nowrap">10.11</td>
<td align="right" class="nowrap">0.21</td>
<td align="right" class="nowrap"> <a href="website" onclick="return
doWindow(this, 700, 500);" class="popup">0</a> </td>
<td align="right" class="nowrap"><a href="website"
onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10
000]</td>
<td align="right" class="nowrap">10 000</td>
<td align="right" class="nowrap">46.21</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap">$53.82</td>
<td align="right" class="nowrap">0.00 %</td>
<td align="right" class="nowrap">$1 151.78</td>
<td align="right" class="nowrap">8.01</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap"> <a href="website" onclick="return
doWindow(this, 700, 500);" class="popup">0</a> </td>
<td align="right" class="nowrap"><a href="website"
onclick="doWindow(this.href, '1024', '768'); return false;">5 000</a> [5
000]</td>
<td align="right" class="nowrap">5 000</td>
<td align="right" class="nowrap">22.51</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap">$222.53</td>
<td align="right" class="nowrap">0.00 %</td>
<td align="right" class="nowrap">$2 399.92</td>
<td align="right" class="nowrap">5.94</td>
<td align="right" class="nowrap">0.01</td>
Problem
I am more interested in the text that contains (the numbers to be exact (Strings)) to do some math afterwards.
So I continued reading the Documentation about Jsoup and found out that I could use .text()
to get rid of the HTML stuff obtaining a long string of the numbers from the HTML file as such:
0 10 000 [10 000] 10 000 20.48 0.00 $28.65 0.00 % $894.69 10.11 0.21 0 10
000 [10 000] 10 000 46.21 0.00 $53.82 0.00 % $1 151.78 8.01 0.00 0 5 000 [5
000] 5 000 22.51 0.00 $222.53 0.00 % $2 399.92 5.94 0.01
How do I separate it in 3 strings and be able to use the numbers?
One approach might be RegEx as I saw in other questions but still can't get the desired result.
EDIT: Some progress made
After some research I found out the way to convert to text AND access the data I wanted by doing:
tds.get(key).text();
Where key is a int number referring to the position in the last String obtained
This solved part of my question as there is one attribute from the HTML which I am not able to get it.
<td align="center">
<input type="text" tabindex="2" name="productData[price]
[{33013477}]" size="10" value="3000.00">
</td>
Where the value that I need is at the attribute value="3000.0"
Thanks for the interest in this question.
To scrape data from HTML source I use a little method I created named getBetween() to carry out the task. Of course the data I personally want seems to always be between strings of some sort:
/**
* Retrieves any string data located between the supplied string leftString
* parameter and the supplied string rightString parameter.<br><br>
* <p>
* <p>
* This method will return all instances of a substring located between the
* supplied Left String and the supplied Right String which may be found
* within the supplied Input String.<br>
*
* @param inputString (String) The string to look for substring(s) in.
*
* @param leftString (String) What may be to the Left side of the substring
* we want within the main input string. Sometimes the
* substring you want may be contained at the very
* beginning of a string and therefore there is no
* Left-String available. In this case you would simply
* pass a Null String ("") to this parameter which
* basically informs the method of this fact. Null can
* not be supplied and will ultimately generate a
* NullPointerException.
*
* @param rightString (String) What may be to the Right side of the
* substring we want within the main input string.
* Sometimes the substring you want may be contained at
* the very end of a string and therefore there is no
* Right-String available. In this case you would simply
* pass a Null String ("") to this parameter which
* basically informs the method of this fact. Null can
* not be supplied and will ultimately generate a
* NullPointerException.
*
* @param options (Optional - Boolean - 2 Parameters):<pre>
*
* ignoreLetterCase - Default is false. This option works against the
* string supplied within the leftString parameter
* and the string supplied within the rightString
* parameter. If set to true then letter case is
* ignored when searching for strings supplied in
* these two parameters. If left at default false
* then letter case is not ignored.
*
* trimFound - Default is true. By default this method will trim
* off leading and trailing white-spaces from found
* sub-string items. General sentences which obviously
* contain spaces will almost always give you a white-
* space within an extracted sub-string. By setting
* this parameter to false, leading and trailing white-
* spaces are not trimmed off before they are placed
* into the returned Array.</pre>
*
* @return (1D String Array) Returns a Single Dimensional String Array
* containing all the sub-strings found within the supplied Input
* String which are between the supplied Left String and supplied
* Right String. You can shorten this method up a little by
* returning a List<String> ArrayList and removing the 'List
* to 1D Array' conversion code at the end of this method. This
* method initially stores its findings within a List object
* anyways.
*/
public String[] getBetween(String inputString, String leftString,
String rightString, boolean... options) {
// Return nothing if nothing was supplied.
if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
return null;
}
// Prepare optional parameters if any supplied.
// If none supplied then use Defaults...
boolean ignoreCase = false; // Default.
boolean trimFound = true; // Default.
if (options.length > 0) {
if (options.length >= 1) {
ignoreCase = options[0];
}
if (options.length >= 2) {
trimFound = options[1];
}
}
// Remove any ASCII control characters from the
// supplied string (if they exist).
String modString = inputString.replaceAll("\\p{Cntrl}", "");
// Establish a List String Array Object to hold
// our found substrings between the supplied Left
// String and supplied Right String.
List<String> list = new ArrayList<>();
// Use Pattern Matching to locate our possible
// substrings within the supplied Input String.
String regEx = Pattern.quote(leftString)
+ (!rightString.equals("") ? "(.*?)" : "(.*)?")
+ Pattern.quote(rightString);
if (ignoreCase) {
regEx = "(?i)" + regEx;
}
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(modString);
while (matcher.find()) {
// Add the found substrings into the List.
String found = matcher.group(1);
if (trimFound) {
found = found.trim();
}
list.add(found);
}
String[] res;
// Convert the ArrayList to a 1D String Array.
// If the List contains something then convert
if (list.size() > 0) {
res = new String[list.size()];
res = list.toArray(res);
} // Otherwise return Null.
else {
res = null;
}
// Return the String Array.
return res;
}
Getting the web-page HTML source is the easy part. To get the needed numerical values from the "desired output" you initially posted (shown below)
HTML Source:
<td align="right" class="nowrap"> <a href="website" onclick="return
doWindow(this, 700, 500);" class="popup">0</a> </td>
<td align="right" class="nowrap"><a href="website"
onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10
000]</td>
<td align="right" class="nowrap">10 000</td>
<td align="right" class="nowrap">20.48</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap">$28.65</td>
<td align="right" class="nowrap">0.00 %</td>
<td align="right" class="nowrap">$894.69</td>
<td align="right" class="nowrap">10.11</td>
<td align="right" class="nowrap">0.21</td>
<td align="right" class="nowrap"> <a href="website" onclick="return
doWindow(this, 700, 500);" class="popup">0</a> </td>
<td align="right" class="nowrap"><a href="website"
onclick="doWindow(this.href, '1024', '768'); return false;">10 000</a> [10
000]</td>
<td align="right" class="nowrap">10 000</td>
<td align="right" class="nowrap">46.21</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap">$53.82</td>
<td align="right" class="nowrap">0.00 %</td>
<td align="right" class="nowrap">$1 151.78</td>
<td align="right" class="nowrap">8.01</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap"> <a href="website" onclick="return
doWindow(this, 700, 500);" class="popup">0</a> </td>
<td align="right" class="nowrap"><a href="website"
onclick="doWindow(this.href, '1024', '768'); return false;">5 000</a> [5
000]</td>
<td align="right" class="nowrap">5 000</td>
<td align="right" class="nowrap">22.51</td>
<td align="right" class="nowrap">0.00</td>
<td align="right" class="nowrap">$222.53</td>
<td align="right" class="nowrap">0.00 %</td>
<td align="right" class="nowrap">$2 399.92</td>
<td align="right" class="nowrap">5.94</td>
<td align="right" class="nowrap">0.01</td>
<td align="right" class="nowrap"> <a href="website" onclick="return
<td align="center">
<input type="text" tabindex="2" name="productData[price]
[{33013477}]" size="10" value="3000.00">
</td>
I would use the getBetween() method something like this:
// Let's assume the "desired output" you acquired
// is contained within a Text file named "HtmlData.txt".
// Hold our scraped data in a 2D List inteface.
List<List<String>> list = new ArrayList<>();
// Read File using BufferedReader in a Try With Resources block...
try (BufferedReader reader = new BufferedReader(new FileReader("HtmlData.txt"))) {
String line;
List<String> numbers = null;
while ((line = reader.readLine()) != null) {
numbers = new ArrayList<>();
line = line.trim();
if (line.equals("")) {
continue;
}
if (line.startsWith("onclick=\"doWindow(this.href,")) {
while ((line = reader.readLine()) != null) {
line = line.trim();
if (line.endsWith("return")) {
list.add(numbers);
break;
}
if (line.equals("")) {
continue;
}
if (line.startsWith("<td align=\"right\" class=\"nowrap\">")) {
numbers.add(getBetween(line, "<td align=\"right\" class=\"nowrap\">", "</td>", true, true)[0]);
}
}
}
if (line.contains("name=\"productData[price]")) {
while ((line = reader.readLine()) != null) {
line = line.trim();
if (line.equals("")) {
continue;
}
if (line.startsWith("[{33013477}]")) {
numbers.add("Product Price: " + getBetween(line, "value=\"", "\">", true, true)[0]);
list.add(numbers);
break; // DONE
}
}
}
}
if (numbers != null && !numbers.isEmpty()) {
list.add(numbers);
}
}
catch (IOException ex) {
ex.printStackTrace();
}
// Display our findings to the Console Window in a
// table style format:
for (int i = 0; i < list.size(); i++) {
for (int j = 0; j < list.get(i).size(); j++) {
System.out.printf("%-10s ", list.get(i).get(j));
}
System.out.println("");
}
In case you didn't notice, the other portion you desire from the lines:
<td align="center">
<input type="text" tabindex="2" name="productData[price]
[{33013477}]" size="10" value="3000.00">
</td>
was also contained within the file data. When the code is run you will see the following displayed within the Console Window:
10 000 20.48 0.00 $28.65 0.00 % $894.69 10.11 0.21
10 000 46.21 0.00 $53.82 0.00 % $1 151.78 8.01 0.00
5 000 22.51 0.00 $222.53 0.00 % $2 399.92 5.94 0.01
Product Price: 3000.00