Search code examples
phphtmldomgetelementsbytagname

Remove Unwanted Text When Using (getElementsByTagName)


Is there away to remove unwanted text when using (getElementsByTagName) for example.

This gets the published date for the movie for my site

$spans = $dom->getElementsByTagName('span');
for($i=0; $i <$spans-> length; $i++){
    $itemprop = $spans->item($i)->getAttribute("itemprop");
    if ($itemprop == "datePublished"){
        if ($spans->item($i)->textContent!='-'){
            $res['published'] = trim($spans->item($i)->textContent);
        }
    }
}

But what happens is instead of getting this.

12 July 2011

It gets this instead.

12 July 2011 10:47 PM, UTC

So is any code i could add to remove this part.

10:47 PM, UTC

Solution

  • You could use a regular expression to pull out the value:

    preg_match('/^\d+ \w+ \d+/', $spans->item($i)->textContent, $matches);
    list(, $published_date) = $matches;
    

    Assuming the format of the date doesn't change you shouldn't have a problem. A much better idea however would be parsing it with DateTime::createFromFormat though. This should be correct:

    $published_date = DateTime::createFromFormat("d M Y h:i A, e", $spans->item($i)->textContent);
    

    Edit: Updated original code from question with recommended changes:

    $spans = $dom->getElementsByTagName('span');
    for($i=0; $i < $spans->length; $i++){
        $itemprop = $spans->item($i)->getAttribute("itemprop");
        if ($itemprop == "datePublished"){
            if ($spans->item($i)->textContent!='-'){
                $text_content = trim($spans->item($i)->textContent);
                $published_date = DateTime::createFromFormat("d M Y h:i A, e", $text_content);
                $res['published'] = $published_date->format("d M Y");
            }
        }
    }