Search code examples
javaandroidhtmlcleaner

How do I get a cleaned html file from HtmlCleaner?


My application downloads a certain website as HTML file the first time it is started. The HTML file is very messy ofcourse, so I want to clean it with HtmlCleaner, so that I can then parse it with Jsoup. But how do I get a new cleaned html item after it was cleaned?

I did some research and this is all i could find:

HtmlCleaner htmlCleaner = new HtmlCleaner();

TagNode root = htmlCleaner.clean(url);

HtmlCleaner.getInnerHtml(root);

String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";

But I can't see where in this code does it write to a new file? If it doesn't, how do I implement it so that the old file will be deleted and the new cleaned html file will be created?


Solution

  • you can do something like following:

    HtmlCleaner cleaner = new HtmlCleaner();
    final String siteUrl = "http://www.themoscowtimes.com/";
    
    TagNode node = cleaner.clean(new URL(siteUrl));
    
    
    // serialize to xml file
    new PrettyXmlSerializer(props).writeToFile(
        node , "cleaned.xml", "utf-8"
    );
    

    or

    // serialize to html file
    SimpleHtmlSerializer serializer = new SimpleHtmlSerializer(htmlCleaner.getProperties());
    serializer.writeToFile(node, "c:/temp/cleaned.html");