When parsing html with JSoup if there is a new line character in a string of text it treats it as if it is not there. Consider: This string of text will wrap
here because of a new line character
. But when JSoup parses this string it returns This string of text will wraphere because of a new line character
. Note that the newline character does not even become a space. I just want it to be returned with a space. This is the text within a node. I have seen other solutions on stackoverflow where people want or don't want a line break after a tag. That is not what I want. I simply want to know if I can modify the parse function to return not ignore new line characters.
I figured it out. I made a mistake in getting the html from the url. I was using this method:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line;
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
When I should have been using the following:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line + "/n";
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
The problem had nothing to do with JSoup. I thought I would make note of it here since I copied this code from Instant Web Scraping with Java by Ryan Mitchell and anyone else following this tutorial might have this same issue.