Search code examples
androidhtml-parsinghtmlcleaner

Android, Proper HTMLCleaner Usage


I know we should basically try to do our own stuff here, and this is not a place to make requests but I really hate having to read stuff from Html, I truly don't understand it's ways.

So, I will be awarding a bounty of 150 points (not that I'm cheap, I just can't do more :( ) if I can get some good help, or at least being pointed in the right direction with some sample code.

What am I trying to accomplish?

  • I am trying to get the Latest News from the following Nasa page.
  • I plan to displays this news on a ListView, of course, the ListView has very little content displayed to begin with, only the data available through the page above, here's a quick mock-up.

That's it, when the user clicks a link, they will be taken to a different fragment that shows the full article, and I'll figure out how to get that later, once I can get this done.

So, I tried using HtmlCleaner with the following bit:

private class CleanUrlTask extends AsyncTask<Void, Void, Void> {

    @Override
    protected Void doInBackground(Void... params) {
        try {
            //try cleaning the nasa page. 
            mNode = mCleaner.clean(mUrl);
        } catch (Exception e) {
            Constants.logMessage("Error cleaning file" + e.toString());
        }
        return null;
    }

    @Override
    protected void onPostExecute(Void result) {         
        try {
            //For now I am just writing to an xml file to sort of read through
            //God is HTML code ugly. 
            new PrettyXmlSerializer(mProps).writeToFile(
                    mNode, FILE_NAME, "utf-8"
                );
        } catch (Exception e) {
            Constants.logMessage("Error writing to file: " + e.toString());
        }
    }       
}

But from there, I am pretty much lost. Here's the XML output btw. I did however notice that there is some sort of repetition on certain tag hierarchy for each article content, it seems to go like this: Left goes for Image and Article Link and Right goes for Article Title and preview content

Class Name Hierarchy

So, if anybody is up for helping me figure out how to obtain the content somehow, i'd greatly appreciate it.

Just as a side note, this project is for educational purposes as part of the 2013 NASA International Space Apps Challenge, more info here.

As a Bonus question, The same link contains information for Current, Future, and Past Expeditions, including the current members, and for each member of the expedition, there is a link to their Bio page.

The Tags for those seem not to be repetitive, however the names seem to be preset and constant, you have "tab1", "tab2", and "tab3", and so forth and on.

What I'd like to obtain from that is:

  • Expedition Number and Dates.
  • Expedition Cew Members
  • A Link to each of the Member's bio.

Again, Thanks for the support if any, I really appreciate it.


Solution

  • So Apparently all I needed to do was to figure out how to use XPATH in order to get the data from the XML output.

    So Basically, the idea with XPATH is that you can get any node withing the XML, and in my case, as you can see in the image above, I wanted to get very specific information.

    Here's the XPATH for the Article Link:

    public static final String XPATH_ARTICLE_LINKS = 
    "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/@href";
    

    Where //div[@class='landing-slide'] means that I am looking for any div whose class name is landing-slide regardless (the '//' declares that) of where they may be located in the document. And from there on, I just go further into the hierarchy of the item to finally obtain the value for the href attribute (attributes are pointed via the '@' character).

    Now that we have the XPATH, we just need to pass this value to the HTML cleaner. I am doing this via a AsyncTask and please keep in mind that this isn't the final code, but it certainly gets the info I want.

    First, the XPATHs used:

    private class News {
        static final String XPATH_ARTICLE_LINKS = 
                "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/@href";
        static final String XPATH_ARTICLE_IMAGES = 
                "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/img/@src";
        static final String XPATH_ARTICLE_HEADERS = 
                "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='landing-fpss-introtext']/div[@class='landing-slidetext']/h1/a";
        static final String XPATH_ARTICLE_DESCRIPTIONS = 
                "//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='landing-fpss-introtext']/div[@class='landing-slidetext']/p";                       
    }
    

    Now for the AsyncTask:

    private class CleanUrlTask extends AsyncTask<Void, Void, Void> {
    
        @Override
        protected Void doInBackground(Void... params) {
            try {
                //try cleaning the nasa page. (Root Node) 
                mNode = mCleaner.clean(mUrl);
    
                //Get all of the article links
                Object[] mArticles = mNode.evaluateXPath(News.XPATH_ARTICLE_LINKS);
                //Get all of the image links
                Object[] mImages = mNode.evaluateXPath(News.XPATH_ARTICLE_IMAGES);
                //Get all of the Article Titles
                Object[] mTitles = mNode.evaluateXPath(News.XPATH_ARTICLE_HEADERS);
                //Get all of the Article Descriptions
                Object[] mDescriptions = mNode.evaluateXPath(News.XPATH_ARTICLE_DESCRIPTIONS);
    
                Constants.logMessage("Found : " + mArticles.length + " articles");
                //Value containers
                String link, image, title, description;
    
                for (int i = 0; i < mArticles.length; i++) {
                    //The Nasa Page returns link that are often not fully qualified URL, so I need to append the prefix if needed. 
                    link = mArticles[i].toString().startsWith(FULL_HTML_PREFIX)? mArticles[i].toString() : NASA_PREFIX + mArticles[i].toString();
                    image = mImages[i].toString().startsWith(FULL_HTML_PREFIX)? mImages[i].toString() : NASA_PREFIX + mImages[i].toString();
                    //On the previous two items we were getting the attribute value
                    //Here, we actually need the text inside the actual element, and so we want to cast the object to a TagNode
                    //The TagNode allows to extract the Text for the supplied element. 
                    title = ((TagNode)mTitles[i]).getText().toString();
                    description = ((TagNode)mDescriptions[i]).getText().toString();
                    //Only log the values for now. 
                    Constants.logMessage("Link to article is " + link);
                    Constants.logMessage("Image from article is " + image);
                    Constants.logMessage("Title of article is " + title);
                    Constants.logMessage("Description of article is " + description);
    
                }
            } catch (Exception e) {
                Constants.logMessage("Error cleaning file" + e.toString());
            }
            return null;
        }
    

    In case anyone was just as lost as I was, I hope this can shed some light upon your way.