Search code examples
jsoupwikipediastaxmediawiki-api

StAX vs jsoup : which is better way to parse a webpage if XML is available through API


I am trying to scrape some data from Wikipedia from 100 pages (approx.). (the pages have same format). Wikipedia has made its API available which gives the content in XML format or I can directly get the data from the page using jsoup.

Which method should I use to scrape the data ?


Solution

  • Since an API is available you should use that approach. The content is well formed and the representation is not going to change without you noticing, which well may be the case with the web page. Using html scraping to get the content you want is error prone since a minor change in the styling can break your selectors and render you scraper useless.

    Since wikipedia uses XML most probably uses a SOAP web service (not necessarily though). If that's the case there should be a wsdl available which you can use with CXF framework to generate a web service client in no time. If you are not familiar with soap services take a look here http://cxf.apache.org/docs/a-simple-jax-ws-service.html .

    CXF comes with some great pojo generator scripts. Check wsdl2java. Running this script you can give a target (wsdl) and the script will generate all the classes you need to consume that web service.

    Update

    Wikipedia uses REST services, it just gives the option of xml as the content type. The response is fairly simple. One could use Gson for the response, and then parse the attribute of interest which is html content with jsoup.

    Update

    1. Create a maven project like this https://www.youtube.com/watch?v=uv9tXFrTLtI
    2. Add Stax dependency in the pom http://mvnrepository.com/artifact/stax/stax/1.2.0
    3. Get coding by beginning with an example http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html