Search code examples
androidseleniumweb-scrapingselendroid

Selendroid as a web scraper


I intend to create an Android application that performs a headless login to a website and then scrape some content from the subsequent page while maintaining the logged-in session.

I first used HtmlUnit in a normal Java project and it worked just fine. But later found that HtmlUnit is not compatible with Android.

Then I tried JSoup library by sending HTTP “POST” request to the login form. But the resulting page does not load up completely since JSoup won't support JavaScript.

I was then suggested to have a look on Selendroid which actually is an android test automation framework. But what I actually need is an Html parser that supports both JavaScript and Android. I find Selendroid quite difficult to understand which I can't even figure out which dependencies to use.

  • selendroid-client
  • selendroid-standalone
  • selendroid-server

With Selenium WebDriver, the code would be as simple as the following. But can somebody show me a similar code example for Selendroid as well?

    WebDriver driver = new FirefoxDriver();
    driver.get("https://mail.google.com/");

    driver.findElement(By.id("email")).sendKeys(myEmail);
    driver.findElement(By.id("pass")).sendKeys(pass);

    // Click on 'Sign In' button
    driver.findElement(By.id("signIn")).click();

And also,

  1. What dependencies to add to my Gradle.Build file?
  2. Which Selendroid libraries to import?

Solution

  • Unfortunately I didn't get Selendroid to work. But I find a workaround to scrape dynamic content by using just Android's built in WebView with JavaScript enabled.

    mWebView = new WebView();
    mWebView.getSettings().setJavaScriptEnabled(true);
    mWebView.addJavascriptInterface(new HtmlHandler(), "HtmlHandler");
    
    mWebView.setWebViewClient(new WebViewClient() {
       @Override
       public void onPageFinished(WebView view, String url) {
           super.onPageFinished(view, url);
    
           if (url == urlToLoad) {
           // Pass html source to the HtmlHandler
           WebView.loadUrl("javascript:HtmlHandler.handleHtml(document.documentElement.outerHTML);");
    
       }
    });
    

    The JS method document.documentElement.outerHTML will retrieve the full html contained in the loaded url. Then the retrived html string is sent to handleHtml method in HtmlHandler class.

    class HtmlHandler {
            @JavascriptInterface
            @SuppressWarnings("unused")
            public void handleHtml(String html) {
                // scrape the content here
    
            }
        }
    

    You may use a library like Jsoup to scrape the necessary content from the html String.