I have a huge wiki dump (~ 50GB after extracting the tar.bz file), from which I want to extract the individual articles. I am using the wikixmlj library to extract the contents and it does gives the title, text, categories mentioned at the end and a few other attributes. But I am more interested in the external links/references associated with each article, for which this library doesnt provide any API for.
Is there any elegant and efficient way to extract that other than parsing the wikiText
that we get with the getWikiText()
API.
Or is there any other java library to extract from this dump file, which gives me the title, content, categories and the references/external-links.
The XML dump contains exactly what the library is offering you: the page text along with some basic metadata. It doesn't contain any metadata about categories or external links.
The way I see it, you have three options: