Search code examples
javaandroidparsinghtmlcleaner

How to Properly Get HTML Asset


I've been following a tutorial for parsing HTML's using HtmlCleaner, specifically, this one: http://xjaphx.wordpress.com/2012/02/04/android-xml-adventure-parsing-html-using-htmlcleaner/

There is one part of the code that gets a url and modifies the html of that page:

HtmlCleaner htmlCleaner = new HtmlCleaner();
CleanerProperties props = htmlCleaner.getProperties();
props.setAllowHtmlInsideAttributes(false);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);

URL url = new URL(incommingURL);
TagNode root = htmlCleaner.clean(url);
Object[] statsNode = root.evaluateXPath(incommingXPath);

How can I properly save a webpage, store it as an asset and accomplish the same goal?

Thanks


Solution

  • Here's one possible approach. Sorry, can't post any pieces of production code. But the good news is that this idea has been successfully used.

    If "web page" means a single file, just copy it to the assets in your project. If multiple files, zip them together.

    Note that there have been some posts around with some magic paths that would allow you to address the asset folder on the device directly. To my best knowledge, those are not documented, just a coincidence. So I would refrain from using them.

    Instead, use AssetManager (Context.getAssets().open(...)) to get the input stream. Copy or unzip (wrap the stream with ZipInputStream, iterate over its ZipEntry elements) your files to either local storage (Context.getFilesDir()) or SD card (Context.getExternalFilesDir(...)).

    Then put the URL (file://...) for the copy of your web page file in incommingURL.