I seem to be having this error where text is being written to a file twice, the first time with incorrect formatting and the second with correct formatting. The method below takes in this URL after it's been converted properly. The method is supposed to get print a newline in between the text conversion of all of the children of dividers that are children of the divider "ffaq" where all the body text resides. Any help would be appreciated. I'm fairly new to using jsoup so an explanation would be nice as well.
/**
* Method to deal with HTML 5 Gamefaq entries.
* @param url The location of the HTML 5 entry to read.
**/
public static void htmlDocReader(URL url) {
try {
Document doc = Jsoup.parse(url.openStream(), "UTF-8", url.toString());
//parse pagination label
String[] num = doc.select("div.span12").
select("ul.paginate").
select("li").
first().
text().
split("\\s+");
//get the max page number
final int max_pagenum = Integer.parseInt(num[num.length - 1]);
//create a new file based on the url path
File file = urlFile(url);
PrintWriter outFile = new PrintWriter(file, "UTF-8");
//Add every page to the text file
for(int i = 0; i < max_pagenum; i++) {
//if not the first page then change the url
if(i != 0) {
String new_url = url.toString() + "?page=" + i;
doc = Jsoup.parse(new URL(new_url).openStream(), "UTF-8",
new_url.toString());
}
Elements walkthroughs = doc.select("div.ffaq");
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
}
outFile.close();
} catch(Exception e) {
e.printStackTrace();
System.exit(1);
}
}
For every element you call text()
you print all the text of its structure.
Assume the below example
<div>
text of div
<span>text of span</span>
</div>
if you call text()
for div element
you will get
text of div text of span
Then if you call text()
for span you will get
text of span
What you need, in order to avoid duplicates is to use ownText()
. This will get only the direct text of the element, and not the text of its children.
Long story sort change this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
To this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
String line = inner.ownText().trim();
if(!line.equals("")) //Skip empty lines
outFile.println(line);
}
}