Search code examples
phpweb-scrapingdomparser

Can't able to print only search result after scrape search result


I am using Simple Html Dom .I am new in web scraping, i am scraping data from booking.com i having problem with printing only the search result URL.My code bellow

<?php

    include 'simple_html_dom.php';

    $searchText = "Venice";
    $searchText = str_replace(" ", "+", $searchText);

    $url = "https://www.booking.com/searchresults.en-gb.html?aid=1781605&lang=en-gb&sid=3bb432f656e368125330f71ea0e74e36&sb=1&src=index&src_elem=sb&error_url=https://www.booking.com/index.en-gb.html?aid=1781605;sid=3bb432f656e368125330f71ea0e74e36;sb_price_type=total;srpvid=dc2798d544dd007f&;&ss=".$searchText."&is_ski_area=0&ssne=".$searchText."&ssne_untouched=".$searchText."&dest_id=-132007&dest_type=city&checkin_year=2019&checkin_month=5&checkin_monthday=19&checkout_year=2019&checkout_month=5&checkout_monthday=20&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1";


    print $url."<br>";


    $html = file_get_html($url);

    $i = 0;

    $linkObjs = $html->find('a');

    foreach ($linkObjs as $linkObj) {
        
        $link  = trim($linkObj->href);

        /*if (!preg_match('/^https?/', $link) && preg_match('/^hotel/', $link, $matches) && preg_match('/^https?/', $matches[1])) {
            $link = matches[1];
        } else if (!preg_match('/^https?/', $link)) {
            continue;
        }*/

        if (!preg_match('/^https?/', $link)) {
            continue;
        }

        $i++;

        echo "Link: ". $link . "<br/><hr/>";

    }
?>

Now the problem is i want to print the search result link which have /hotel/ path in URL like https://www.booking.com/hotel/it/nh-collection-venezia-palazzo-barocci.en-gb.html now i don't understand how to setup preg_replace for print only the search result URL also the title.


Solution

  • Using the ^ in an expression means asserting the start of the string which you test for in the second clause:

    if (!preg_match('/^https?/', $link) && preg_match('/^hotel/', $link, $matches) && preg_match('/^https?/', $matches[1])) {
    

    If you want to use preg_match you could use a single expression to check if the string starts with http with an optional s:

    ^https?://.*?/hotel/
    
    • ^ Start of string
    • https?:// Match http, optional s, ://
    • .*? Match any char except a newline non greedy
    • /hotel/ Match literally

    Regex demo | Php demo

    For example:

    if (!preg_match('~^https?://.*?/hotel~', $link)) {
        continue;
    }
    

    Without using a regex you could also use a combination of substr and strpos

    if (!(substr($link, 0, 4 ) === "http" && strpos($link, '/hotel/') !== false)) {
        continue;
    }
    

    Php demo