Search code examples
phparraysstringpreg-match-alltext-extraction

Get substrings from text which start with one of an array of keywords and the substring must not include a second keyword


I want write some function that were accept two parameters $text and $keys. Keys that an array with keys.

At the output we need to get an array, where the keys will be the keys passed to the function (if we found them in the text), and the values ​​will be the text that follows this key, until it comes across the next key or the text ends. If the key is repeated in the text, write only the last value to the array

For example:

Visualized Text: Lorem Ipsum is simply one dummy text of the printing and two typesetting industry. Lorem Ipsum has been the industry's one standard dummy text ever since the three 1500s.

$text = 'Lorem Ipsum is simply one dummy text of the printing and  two typesetting industry. Lorem Ipsum has been the industry\'s one standard dummy text ever since the three 1500s.';

$keys = ['one', 'two', 'three'];

Desired Output:

[
    'one' => 'standard dummy text ever since the',
    'two' => 'typesetting industry. Lorem Ipsum has been the industry\'s',
    'three' => '1500s.'
]

I tried writing a regular expression which will cope with this task, but without success.

Last attempt:

function getKeyedSections($text, $keys) {
    $keysArray = explode(',', $keys);
    $pattern = '/(?:' . implode('|', array_map('preg_quote', $keysArray)) . '):\s*(.*?)(?=\s*(?:' . implode('|', array_map('preg_quote', $keysArray)) . '):\s*|\z)/s';
    preg_match_all($pattern, $text, $matches);

    $keyedSections = [];
    foreach ($keysArray as $key) {
        foreach ($matches[1] as $index => $value) {
            if (stripos($matches[0][$index], $key) !== false) {
                $keyedSections[trim($key)] = trim($value);
                break;
            }
        }
    }

    return $keyedSections;
}

Solution

  • Here is an approach with preg_match_all() which extracts all segments starting with any key and ending before any key. The array_column() call just discards earlier matches for later matches and sets up the desired associative result. (Demo)

    $text = "Lorem Ipsum is simply one dummy text of the printing and  two typesetting industry. Lorem Ipsum has been the industry's one standard dummy text ever since the three 1500s.";
    
    $keys = ['one', 'two', 'three'];
    
    $escaped = implode('|', array_map('preg_quote', $keys));
    
    preg_match_all('#\b(' . $escaped . ')\b\s*\K.*?(?=\s*(?:$|\b(?:' . $escaped . ')\b))#', $text, $m, PREG_SET_ORDER);
    
    var_export(array_column($m, 0, 1));
    

    Output:

    array (
      'one' => 'standard dummy text ever since the',
      'two' => 'typesetting industry. Lorem Ipsum has been the industry\'s',
      'three' => '1500s.',
    )