Search code examples
jsouplinefeed

New line character handling in Jsoup


When parsing html with JSoup if there is a new line character in a string of text it treats it as if it is not there. Consider: This string of text will wrap here because of a new line character. But when JSoup parses this string it returns This string of text will wraphere because of a new line character. Note that the newline character does not even become a space. I just want it to be returned with a space. This is the text within a node. I have seen other solutions on stackoverflow where people want or don't want a line break after a tag. That is not what I want. I simply want to know if I can modify the parse function to return not ignore new line characters.


Solution

  • I figured it out. I made a mistake in getting the html from the url. I was using this method:

    public static String getUrl(String url) {
        URL urlObj = null;
        try{
            urlObj = new URL(url);
        }
        catch(MalformedURLException e) {
            System.out.println("The url was malformed!");
            return "";
        }
        URLConnection urlCon = null;
        BufferedReader in = null;
        String outputText = "";
        try{
            urlCon = urlObj.openConnection();
            in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
            String line = "";
            while((line = in.readLine()) != null){
                outputText += line;
            }
            in.close();
        }
        catch(IOException e){
            System.out.println("There was an error connecting to the URL");
            return "no";
            }
        return outputText;
    }
    

    When I should have been using the following:

    public static String getUrl(String url) {
        URL urlObj = null;
        try{
            urlObj = new URL(url);
        }
        catch(MalformedURLException e) {
            System.out.println("The url was malformed!");
            return "";
        }
        URLConnection urlCon = null;
        BufferedReader in = null;
        String outputText = "";
        try{
            urlCon = urlObj.openConnection();
            in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
            String line = "";
            while((line = in.readLine()) != null){
                outputText += line + "/n";
            }
            in.close();
        }
        catch(IOException e){
            System.out.println("There was an error connecting to the URL");
            return "no";
            }
        return outputText;
    }
    

    The problem had nothing to do with JSoup. I thought I would make note of it here since I copied this code from Instant Web Scraping with Java by Ryan Mitchell and anyone else following this tutorial might have this same issue.