Search code examples
phpparsingsimplexmldomdocumentmediawiki-api

How to break down and parse specific Wikipedia text


I'm have the following working example to retrieve a specific Wikipedia page that returns a SimpleXMLElement Object:

ini_set('user_agent', 'michael@example.com');
$doc = New DOMDocument();
$doc->load('http://en.wikipedia.org/w/api.php?action=parse&page=Main%20Page&format=xml');

$xml = simplexml_import_dom($doc);

print '<pre>';
print_r($xml);
print '</pre>';

Which returns:

SimpleXMLElement Object
(
    [parse] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [title] => Main Page
                    [revid] => 472210092
                    [displaytitle] => Main Page
                )

            [text] => <body><table id="mp-topbanner" style="width: 100%;"...

Silly question/mind blank. What I am trying to do is capture the $xml->parse->text element and in-turn parse that. So ultimately what I want returned is the following object; how do I achieve this?

SimpleXMLElement Object
(
    [body] => SimpleXMLElement Object
        (
            [table] => SimpleXMLElement Object
                (
                    [@attributes] => Array
                        (
                            [id] => mp-topbanner
                            [style] => width:100% ...

Solution

  • After grabbing a fresh tea and eating a banana, here's the solution I've come up with:

    ini_set('user_agent','michael@example.com');
    $doc = new DOMDocument();
    $doc->load('http://en.wikipedia.org/w/api.php?action=parse&page=Main%20Page&format=xml');
    $nodes = $doc->getElementsByTagName('text');
    
    $str = $nodes->item(0)->nodeValue;
    
    $html = new DOMDocument();
    $html->loadHTML($str);
    

    This then allows me to get an elements value, which is what I was after. For example:

    echo "Some value: ";
    echo $html->getElementById('someid')->nodeValue;