Search code examples
xpathdomdocumentwikipediawikipedia-apidomxpath

PHP + Wikipedia: Get content from the first paragraph in a Wikipedia article?


I’m trying to use Wikipedia’s API (api.php) to get the content of a Wikipedia article provided by a link (like: http://en.wikipedia.org/wiki/Stackoverflow). And what I want is to get the first paragraph (which in the example of the Stackoverflow wiki article is: Stack Overflow is a website part of the Stack Exchange network[2][3] featuring questions and answers on a wide range of topics in computer programming.[4][5][6]).

I’m going to do some data manipulation with it.

I’ve tried with the API url: http://en.wikipedia.org/w/api.php?action=parse&page=Stackoverflow&format=xml but it gives me some kind of error. It outputs:

<api>
<parse displaytitle="Stackoverflow" revid="289948401">
<text xml:space="preserve">
<ol> <li>REDIRECT <a href="/wiki/Stack_Overflow" title="Stack Overflow">Stack Overflow</a></li> </ol> <!-- NewPP limit report Preprocessor node count: 1/1000000 Post-expand include size: 0/2048000 bytes Template argument size: 0/2048000 bytes Expensive parser function count: 0/500 --> <!-- Saved in parser cache with key enwiki:pcache:idhash:21772484-0!*!0!!*!* and timestamp 20110525165333 -->
</text>
<langlinks/>
<categories/>
<links>
<pl ns="0" exists="" xml:space="preserve">Stack Overflow</pl>
</links>
<templates/>
<images/>
<externallinks/>
<sections/>
</parse>
</api>

I found this snippet of code that I’ve tried

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; 

but I can’t get the HTML content in the variable $wikiPage.

I do not know if this is the best or most optimal way to do it so please feel free to comment on that and otherwise any suggestion or solutions would be very appreciated.

Thank you
- Mestika


Solution

  • You're getting the contents of a redirect page. Replace 'Stackoverflow' with 'Stack_Overflow' and it should work.

    The API does have support for an &redirects option, which will resolve redirects for you.