Search code examples
javaandroidhtmlmaster-pageshtml-content-extraction

How to get the code of a contentPlaceHolder programmatically



I want to be able to extract the code of a contentPlaceHolder using any Html parsers. The problem is, I need a url, but because it is a masterpage I can't have it.

Actually there is a select tag in which you select an option, and when you select one, it loads a contentPlaceHolder. I want to extract the code from the contentPlaceHolder.

NOTE: I didn't build the website.

Here are some pictures to explain it better:

This is the masterpage. enter image description here

This is the the content (when you press the red sign): enter image description here

I hope it is clear enough to understand... Thanks!


Solution

  • First and foremost, this requires JSoup.

    try {
        // Regexp pattern used to strip the links
        Pattern p = Pattern.compile("\'([^\']*)\'");
    
        // First, let's find the IFRAME from the main page
        Document doc = Jsoup.connect("http://blich.co.il/timetable-shahaf").get();
        Elements iframe = doc.select("iframe");
        if (!iframe.isEmpty()) {
            String src = iframe.get(0).absUrl("src");
            if (!TextUtils.isEmpty(src)) {
                // Now we need to fetch the contents of the IFRAME
                doc = Jsoup.connect(src).get();
    
                // This is where we manipulate the <select ..> statement. There's only
                // one on this page, so this will be done quick and dirty
                Elements selects = doc.select("select.HeaderClasses");
                Elements options = selects.select("option");
                if (!options.isEmpty()) {
                    // There's a lot of options here.. dunno what they mean, so let's just
                    // select a **random** and go with that. Your code should probably let the user
                    // choose from a dialog or something.
                    Collections.shuffle(options);
                    Element option = options.get(0);
    
                    String name=selects.get(0).attr("name");
                    if (!TextUtils.isEmpty(name)) {
                        doc = Jsoup.connect(src)
                                .data("__EVENTTARGET", name)
                                .data("__EVENTARGUMENT", "")
                                .data(name, option.attr("value")) // Add random option value
                                .data("__VIEWSTATE", 
                                        doc.select("input#__VIEWSTATE").attr("value"))
                                .data("__LASTFOCUS", "")
                                .post();
                    }
                }
                // All the relevant links are stored in a td with the class "HeaderCell"
                Elements links = doc.select("td.HeaderCell a");
                for (Element link : links) {                    
                    // These are all links to a silly java-script method, _doPostBack(..)
                    // function __doPostBack(eventTarget, eventArgument) {
                    //   if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
                    //      theForm.__EVENTTARGET.value = eventTarget;
                    //      theForm.__EVENTARGUMENT.value = eventArgument;
                    //      theForm.submit();
                    //   }
                    // }
                    // The important bits appear to be eventTarget and eventArgument at least,
                    // but none of the links define an eventArgument in any case - so we just
                    // need "eventTarget".
    
                    // Naïve splitting, take the first quoted string
                    Matcher m = p.matcher(link.attr("href"));
                    if (m.find()) {
                        String eventTarget = m.group(1);
                        // The eventTarget you're looking for ends with 'ChangesTable'
                        if (eventTarget != null && eventTarget.endsWith("ChangesTable")) {
                            // Now we need to do a POST :-D - this API requires us to retain
                            // __VIEWSTATE - so we need to post that to.
                            doc = Jsoup.connect(src)
                                    .data("__EVENTTARGET", eventTarget)
                                    .data("__EVENTARGUMENT", "")
                                    .data("__VIEWSTATE", 
                                            doc.select("input#__VIEWSTATE").attr("value"))
                                    .data("__LASTFOCUS", "")
                                    .post();
    
    
                            // All the lesson information is stored in a div with the class 
                            // TTLesson, so let's select those
                            Elements lessons = doc.select("div.TTLesson");
                            if (lessons.isEmpty()) {
                                Log.w(TAG, "Unable to list any lessons");
                            } else {
                                for (Element lesson : lessons) {
                                    // This is were knowledge of Hebrew would come in handy -
                                    // but this will list all lessons. You should be able
                                    // to figure out how to find the one you want.
                                    System.out.println(lesson);
                                }
                            }
                        }
                    }
                }
            } else {
                Log.w(TAG, "Unable to find iframe src");
            }
        } else {
            Log.w(TAG, "Unable to find iframe");
        }
    } catch (IOException e) {
        Log.w(TAG, "Error reading timetable", e);
    }
    

    This will list all the lessons on the page you wanted. I'll leave finding the right lesson up to you since I don't know enough Hebrew to discern what the hell any of the cells contain.

    Edit: Now the example will randomly select an option in the <select> and refresh the page.