Search code examples
machine-learningwikipediainformation-retrievalclosestweb-search

Closest wikipedia Page for a given text


Say for example a person writes as a query - "d dark knight rses". I want to find the nearest wikipedia page that is - http://en.wikipedia.org/wiki/The_Dark_Knight_Rises

What are possible ways to do that?

One simple way that I could think of is that search the given query on google appended with the term wikipedia. Then in the results look for the first wikipedia page. If there is no wikipedia page even in top 5 pages, return Sorry.

But is there any other convenient method or API call which avoids using Google.

Edit : CLOSEST - For example "d dark night" might result in "The Dark Night" or "The Dark Knight". Both of these are valid answers. Even though the former is closer to the query, but I guess the later is a better answer because that is what user query is likely to be.


Solution

  • Maybe you can use the official Wikipedia API, here an example of opensearch call with dark night query:

    $ curl "https://en.wikipedia.org/w/api.php?action=opensearch&search=dark%20night"
    

    This returns:

    [
        "dark night", 
        [
            "Dark Night", 
            "Dark Night of the Soul", 
            "Dark Night of the Soul (album)", 
            "Dark Night of the Scarecrow", 
            "Dark Night (song)", 
            "Dark Night (film)", 
            "Dark night rises", 
            "Dark night (roller coaster)", 
            "Dark night sky paradox"
        ]
    ]
    

    UPDATE: also another approach is to download Wikipedia data dump and do some searching locally.