Search code examples
javahtmlhtmlcleaner

Clean up HTML input using HTMLcleaner


I am writing a java project using the HTMLCleaner library and save the output as a XML file this is the code that I wrote :

URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "google.com");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());

// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
        tagNodeRoot , "cleaned.xml", "utf-8"
);

The problem is that after running the project, cleaned.xml file is empty.


Solution

  • The problem is that the page you are trying to access is configured to redirect to HTTPS. This does, for whatever reason, not work, and so the input stream is empty. If you change the URL to HTTPS, it's working fine:

    URL urlSB = new URL("https://www.groupon.com/browse/chicago?z=skip");
    URLConnection urlConnection = urlSB.openConnection();
    urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:5.0) Gecko/20100101 Firefox/25.0");
    urlConnection.connect();
    HtmlCleaner cleaner = new HtmlCleaner();
    CleanerProperties props = cleaner.getProperties();
    props.setNamespacesAware(false);
    TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());
    new PrettyXmlSerializer(props).writeToFile(tagNodeRoot, "cleaned.xml", "utf-8");