Search code examples
phpweb-scraping

Simple data scraping using PHP loop/ foreach


I have some code which scrapes a string between two other strings (sandwich). It is working - but I need to loop through various "sandwich" strings.

//needle in haystack
$result 'sandwich: Today is a nice day.
    sandwich: Today is a cloudy day.
    sandwich: Today is a rainy day.
    sandwich type 2: Yesterday I had an awesome time. 
    sandwich type 2: Yesterday I had an great time.';

$beginString = 'today is a';
$endString = 'day';

function extract_unit($haystack, $keyword1, $keyword2) {
    $return = array();

    while($a = strpos($haystack, $keyword1, $a)) {   // loop until $a is FALSE
        $a+=strlen($keyword1);                    // set offset to after  $keyword1 word

        if($b = strpos($haystack, $keyword2, $a)) {  // if found $keyword2 position's
            $return[] = trim(substr($haystack, $a, $b-$a)); // put result to $return array
        }
    }
    return $return;  
}

$text = $result;
$unit = extract_unit($text, $beginString, $endString);
print_r($unit);

//$unit returns= nice, cloudy and rainy

I need to loop through different types of sentences/sandwiches and be able to capture all the adjectives (nice cloudy rainy awesome great):

//needle in haystack
$result 'sandwich: Today is a nice day.
    sandwich: Today is a cloudy day.
    sandwich: Today is a rainy day.
    sandwich type 2: Yesterday I had an awesome time. 
    sandwich type 2: Yesterday I had an great time.';

$beginString1 = 'today is a';
$endString1 = 'day';
$beginString2 = 'Yesterday I had an';
$endString2 = 'time';

[scaping code with loop...]
print_r($unit);

This is the goal to end up with this array:

Array ( [0] => nice [1] => cloudy [2] => rainy [3] => awesome [4] => great ) 

Any ideas? Much appreciated.


Solution

  • You could use a regular expression to scrape into the strings, if you do not have problems using arrays instead of separated strings, this could be a sample code to do that:

    $starts = array('Today is a', 'Yesterday I had an');
    $ends = array('day', 'time');
    
    $haystack = array(
        'Today is a nice day.',
        'Today is a cloudy day.',
        'Today is a rainy day.',
        'Yesterday I had an awesome time.',
        'Yesterday I had an great time.'
    );
    
    function extract_unit($haystack, $starts, $ends){
    
        $reg = '/.*?(?:' . implode('|', $starts) . ')(.*?)(?:' . implode('|', $ends) . ').*/';
    
        foreach($haystack as $str){
    
            if(preg_match($reg, $str)) $return[] = preg_replace($reg, '$1', $str);
    
        }
    
        return $return;
    
    }
    
    print_r (extract_unit($haystack, $starts, $ends));
    

    EDIT

    Following the @ven comments I've made some changes to the code, now is more precise:

    //---Array with all sandwiches
    $between = array(
        array('hay1=', 'hay=Gold'),
        array('hay2=', 'hay=Silver')
    );
    
    $haystack = 'Data set 1: hay2= this is a bunch of hay  hay1= Gold_Needle hay=Gold
                 Data Set 2: hay2=Silver_Needle hay=Silver';
    
    function extract_unit($haystack, $between){
    
        $return = array();
    
        foreach($between as $item){
    
            $reg = '/.*?' . $item[0] . '\s*(.*?)\s*' . $item[1] . '.*?/';
    
            preg_match_all($reg, $haystack, $finded);
    
            $return = array_merge($return, $finded[1]);
    
        }
    
        return $return;
    
    }
    
    print_r (extract_unit($haystack, $between));
    

    The result will be:

    Array
    (
        [0] => Gold_Needle
        [1] => Silver_Needle
    )
    

    Here you have an Ideone sample code