Search code examples
phpregexpreg-matchtext-extraction

Extract data with regex (with optional string before the match)


I have an array of strings. I am trying to extract the data in the parentheses, ( and ), from each string. The problem is that it does not extract the data in the middle from the first element, if there is nothing else in front of it.

This is the code snippet with an indication of the needed/captured values:

<?php

$data = [
    'aaa|45.85[u]52.22 - 43.75 - 36.5[d]25.75',
// #1^^^       #2^^^^^ #3^^^^^        #4^^^^^
    'bbb|238.4[u]345.45 - 24.1[d]13.85 - 56.4[d]56'
// #1^^^       #2^^^^^^        #3^^^^^        #4^^
];

$new = [];

foreach ($data as $element)
{
    preg_match("#^(.*?)\|[\w\[\.]+\]?(.*?) - [\w\[\.]+\]?(.*?) - [\w\[\.]+\]?(.*?)$#", $element, $match);
    
    $string = $match[1];
    $num1 = $match[2];
    $num2 = $match[3];
    $num3 = $match[4];

    $new[$string] = [
        'num1' => $num1,
        'num2' => $num2,
        'num3' => $num3,
    ];
}

print_r($new);

?>

The code above should gives me this result:

$new = [
    'aaa' => [
        'num1' => '52.22',
        'num2' => '43.75',
        'num3' => '25.75',
    ],

    'bbb' => [
        'num1' => '345.45',
        'num2' => '13.85',
        'num3' => '56',
    ]
];

But it gives me this:

$new = [
    'aaa' => [
        'num1' => '52.22',
        'num2' => '',
        'num3' => '25.75',
    ],

    'bbb' => [
        'num1' => '345.45',
        'num2' => '13.85',
        'num3' => '56',
    ]
];

Solution

  • See this demonstration of how your second [\w\[\.]+ character class is over-matching because dots and digits are greedily matched AND your capture group allows a zero-width match. https://regex101.com/r/zq6czS/1

    With only two sample strings, it is very hard to confidently suggest a truly optimized pattern, but I recommend seeking ways to greedy quantifiers for improved performance.

    1. Before the first pipe, collect all characters that are not a pipe -- ([^|]+).
    2. To capture the non-whitespace substring after optionally occurring "float then square-braced letter", again use a negated character class -- (?:[^\]]+\])?(\S+)

    The advice in #2 just repeats three times; delimited by "space hyphen space", of course.

    Code: (Demo) (or with functionless assignments)

    $data = [
        'aaa|45.85[u]52.22 - 43.75 - 36.5[d]25.75',
        'bbb|238.4[u]345.45 - 24.1[d]13.85 - 56.4[d]56'
    ];
    
    $result = [];
    foreach ($data as $element) {
        if (preg_match("#^([^|]+)\|(?:[^\]]+\])?(\S+) - (?:[^\]]+\])?(\S+) - (?:[^\]]+\])?(\S+)$#", $element, $matches)) {
            unset($matches[0]);
            $result[array_shift($matches)] = array_combine(['num1', 'num2', 'num3'], $matches);
        }
    }
    var_export($result);
    

    Once you have your 5-element output matches array, remove the fullstring match ($matches[0]), then peel off the new first element and use it as the first level key, then the remaining elements can be added to the subarray.