Search code examples
javaspringseleniumautomationjava-service-wrapper

Exposing a web site through web services


I know what I am asking is somehow weird. There is a web application (which we don't have access to its source code), and we want to expose a few of its features as web services.

I was thinking to use something like Selenium WebDriver, so I simulate web clicks on the application according to the web service request.

I want to know whether this is a better solution or pattern to do this.

I shall mention that the application is written using Java, Spring MVC (it is not SPA) and Spring Security. And there is a CAS server providing SSO.


Solution

  • We do something similar to access web banking on behalf of a user, scrape his account data and obtain a credit score. In most cases, we have managed to reverse-engineer mobile apps and sniff traffic to use undocumented APIs. In others, we have to fall back to web scraping.

    You can have two other types of applications to scrape:

    • Data is essentially the same for any user, like product listings in Amazon
    • Data is specific to each user, like in a banking app.

    In the firs case, you could have your scraper running and populating a local database and use your local data to provide the web service. In the later case, you cannot do that and you need to scrape the site on user's request.

    I understand from your explanation that you are in this later case.

    When web scraping you can find really difficult web apps:

    • Some may require you to send data from previous requests to the next
    • Others render most data on the client with JavaScript

    If any of these two is your case, Selenium will make your implementation easier though not performant.

    Implementing the first without selenium will require you to do lots of trial an error to get the thing working because you will be simulating the requests and you will need to know what data is expected from the client. Whereas if you use selenium you will be executing the same interactions that you do with the browser and hence sending the expected data. Implementing the second case requires your scraper to support JavaScript. AFAIK best support is provided by selenium. HtmlUnit claims to provide fair support, and I think JSoup provides no support to JavaScript.

    Finally, if your solution takes too much time you can mitigate the problem providing your web service with a notification mechanism, similar to Webhooks or Resthooks:

    1. A client of your web service would make a request for data providing a URI they would like to get notified when the results are ready.
    2. Your service would respond immediatly with an id of the request and start scraping the necessary info in the background.
    3. If you use skinny payload model, when the scraping is done, you store the response in your data store with an id identifying the original request. This response will be exposed as a resource.
    4. You would execute an HTTPPOST on the URI provided by the client. In the body of the request you would add the URI of the response resource.
    5. The client can now GET the response resource and because the request and response have the same id, the client can correlate both.