Search code examples
javaregexhtmlunit

Regex expression in java htmlunit


I am trying to advance my knowledge of java, by trying to automate webpage scraping and form input. I have experimented with jsoup and now htmlunit. I found a htmlunit example that I am trying to run.

public class GoogleHtmlUnitTest {
    static final WebClient browser;

    static {
        browser = new WebClient();
        browser.getOptions().setJavaScriptEnabled(false);
//        browser.setJavaScriptEnabled(false);
    }

    public static void main(String[] arguments) {
        boolean result;
        try {
            result = searchTest();
        } catch (Exception e) {
            e.printStackTrace();
            result = false;
        }

        System.out.println("Test " + (result? "passed." : "failed."));
        if (!result) {
            System.exit(1);
        }
    }

    private static boolean searchTest() {
        HtmlPage currentPage;

        try {
            currentPage = (HtmlPage) browser.getPage("http://www.google.com");
        } catch (Exception e) {
            System.out.println("Could not open browser window");
            e.printStackTrace();
            return false;
        }
        System.out.println("Simulated browser opened.");

        try {
            ((HtmlTextInput) currentPage.getElementByName("q")).setValueAttribute("qa automation");
            currentPage = currentPage.getElementByName("btnG").click();
            System.out.println("contents: " + currentPage.asText());
            return containsPattern(currentPage.asText(), "About .* results");
        } catch (Exception e) {
            System.out.println("Could not search");
            e.printStackTrace();
            return false;
        }
    }

    public static boolean containsPattern(String string, String regex) {
        Pattern pattern = Pattern.compile(regex);

        // Check for the existence of the pattern
        Matcher matcher = pattern.matcher(string);
        return matcher.find();
    }
}

It works with some htmlunit errors, that I have found on stackoverflow to ignore. The program runs correctly, so I am taking the advice and ignoring the errors.

Jul 31, 2016 7:29:03 AM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNING: CSS error: 'https://www.google.com/search?q=qa+automation&sa=G&gbv=1&sei=_eCdV63VGMjSmwHa85kg' [1:1467] Error in declaration. '*' is not allowed as first char of a property.

My problem at the moment is the regex expression being used for the search. If I am understanding this correctly, “qa automation” is being googled and the retrieved page is being searched by:

return containsPattern(currentPage.asText(), "About .* results");

What is throwing me is “About .* results”. This is the regex, but I don't get how it is being interpreted. What is being searched for on the retrieved page?


Solution

  • .* means "zero or more of any character," in another words, a complete wildcard. It can be

    About 28 results
    About 2864 results
    About 2,864 results
    About ERROR results
    About  results
    

    (Response to comments.)

    To be honest, you should find a quick regular expressions tutorial. You're missing some very basic things and instead relying on your own intuitive sense of how "searching" should work, which is leading to confusion.

    I like teaching though, so here's a little more :-)

    Go to this RegExr link. I already set it up with this expression:

    /^About .* results$/gm
    

    Ignore the /^ and the $/gm. (If you really want to know, the two slashes is just the conventional notation for regular expressions. The ^ and $ are "anchors" that force a "full match"—that's why it seemed like "About" had to be in position 0. Whatever regex engine you're using, it seems to force anchors. The g is a flag that just means "Highlight every match," and the m is a flag that means, "Treat every line as a separate entry.") Anyway, back to the main expression:

    About .* results
    

    And its matches:

    enter image description here

    See how if you put a character on either side, it's no longer a match? Again, that's because of anchoring. The expression expects "A" as the first character, so "x" fails. The expression also expects the last character to be "s", so "x" would fail there too. But why did About results fail? It's because there's a space around each side of the .*. The .* wildcard is allowed to match nothing, but the spaces have to match just like letters and numbers. So a single space won't cut it; you need at least two.

    You wrote that you tried 230 .* results. See, you're not understanding that regex works character by character, with certain "special" characters you can use. Your expression means, "A string that begins with 230, a space, then anything, a space, "results", and nothing after."

    [...] how would I code regex to find the "230" in any position followed by "results", ie "foobar 230 foobar2 results"?

    In other words, you want to find a string that starts with anything, has 230 somewhere, has more of anything, a space, "results", and nothing more:

    .*230.* results
    

    Do you want the exact number, 230?

    .* 230 results