Search code examples
javamediawikiwikipediawikipedia-apimediawiki-api

How can I use the Wikipedia API to extract/parse the link I am looking for?


In Wikipedia 95% of the links leads to the Philosophy page. I am trying to write a program in Java that takes any link on wikipedia and clicks the first link(which is not citation/sound/extraneous link and also ignores parentsitzed link .)

For e.g if you start with this url http://en.wikipedia.org/wiki/Dutch_people, it should click Ethnic Group http://en.wikipedia.org/wiki/Ethnic_group and so on until it reaches Philosophy

You should see this Getting_to_Philosophy Check http://xefer.com/wikipedia (type any word) to see how it works .

I already wrote the back end that stores the data in database in 3 columns Unique_URL_Id URL_Link Next_URL_Id so latter on printing the whole path will be easier.

The backend works fine(if I give it just a list of links to follow). However extracting and finding the first link is something not working as it should work.

Here is sample code I wrote just for extracting from a URL using jSoap API

public static void extractWikiPage(String title) throws IOException{

        Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Europe").get();
        //int titles = doc.toString().indexOf("(");

        //Get the first paragraph where the main body contents starts
        String body = doc.getElementsByTag("p").first().toString();
        System.out.println(body);                   
            Document doc2= Jsoup.parse(body);
            Elements href=doc2.getElementsByTag("a");
            int x="".indexOf("");
            for(Element h: href){
                System.out.println(h.toString());
            }
            //System.out.println(linkText);
            System.exit(1);

        }

I am just finding the first occurence of '<p>' since that's where 95% of the links to the next page start. And in that paragraph, I am trying to get all the links but I need the first one that satisfies the condition I wrote above.

How can I use Wikipedia API to solve extracting the data I am looking for.I appreciate your help.


Solution

  • /w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&rawcontinue=&titles=Dutch_people is the query that returns the wikitext for that page.

    You'll have to parse that result to get the data you want back. You'll be looking for the first thing that is inside of [[double square brackets]] (probably after /\{\{Infobox(.*?)\}\}/i or something like that to exclude links in the infobox and any maintenance tags that might be on the page) that don't start with "something:" to eliminate all interwiki links and categories and file/media pages.