Search code examples
phpregexvalidationtext-extractionalternation

Match numeric substring which is preceded or followed by one of two specific substrings


I have a program that selects an amount from the chain that has or CZK behind it. How do I edit an expression ( pattern) to check if or CZK is in front of a number? See string1 and string2:

$string='Rohlík 4,99 Kč 51235';
//$string1='Rohlík CZK 4,99 51235';
//$string2='Rohlík Kč4,99 51235';

$replace = [' ', '.'];

$string = str_replace($replace,"",$string);

$string = str_replace(',',".",$string);


/*Change?*/

$pattern = '/[0-9]*[.]?[0-9]*[Kč,CZK]/';
preg_match($pattern, $string, $matches); // => 4.99 Kč
$string = $matches;

$pattern = '/[0-9]*[.]?[0-9]*/';
preg_match($pattern, $string[0], $matches);

$price = $matches[0];
print_r($price); // => 4.99

Solution

  • Use logical grouping in your pattern to match the label which may come before or after the targeted number (replacing the comma with a dot can be done after this step).

    Code: (Demo)

    $strings = [
        'Rohlík 4,99 Kč 51235',
        'Rohlík CZK 4,99 51235',
        'Rohlík Kč4,99 51235',
        'Rohlík foo4,99 51235'
    ];
    
    foreach ($strings as $string) {
        var_export(
            preg_match('/\b(?:(?:Kč|CZK) ?\K\d+(?:,\d+)?|\d+(?:,\d+)?(?= ?(?:Kč|CZK)))\b/u', $string, $m)
            ? $m[0]
            : 'not found'
        );
        echo "\n";
    }
    

    Output:

    '4,99'
    '4,99'
    '4,99'
    'not found'
    

    Pattern Breakdown:

    /                     #starting pattern delimiter
      \b                  #word boundary to guarantee matching the whole label
      (?:                 #start non-capturing group 1
        (?:Kč|CZK) ?      #non-capturing group 2 requiring one of two labels, optionally followed by a space
        \K                #forget all previously matched characters
        \d+(?:,\d+)?      #match the targeted integer/float value with comma as decimal placeholder
        |                 #OR
        \d+(?:,\d+)?      #match the targeted integer/float value with comma as decimal placeholder
        (?= ?(?:Kč|CZK))  #lookahead to for optional space followed by one of the two labels
      )                   #close non-capturing group 1
      \b                  #word boundary to guarantee matching the whole label
    /                     #ending pattern delimiter
    u                     #unicode/multi-byte flag