Different regex preg_match_all results in live test and my script

I have a following string:

{ Author = {Smith, John and James, Paul and Hanks, Tom}, Title = {{Some title}}, Journal = {{Journal name text}}, Year = {{2022}}, Volume = {{10}}, Number = {{11}}, Month = {{DEC}}, Abstract = {{Abstract text abstract text, abstract. Abstract text - abstract text? Abstract text! Abstract text abstract text abstract text abstract text abstract text abstract text abstract text abstract text, abstract text. Abstract text abstract text abstract text abstract text abstract text.}}, DOI = {{10.3390/ijms19113496}}, Article-Number = {{1234}}, ISSN = {{1234-5678}}, ORCID-Numbers = {{}}, Unique-ID = {{ISI:1234567890}}, }

And my goal is to get these values in associative array. I'm trying this regex:

/([a-zA-Z0-9\-\_]+)\s*=\s*(\{(.*)\}|\d{4})/

using preg_match_all, without additional arguments (just regex, input and output) but while it's working correct on online testers like this, it does not return all the values in my .php script, just some of them. Especially, abstract and author is somehow never matched. I tried changing arguments (currently using U (non-greedy matching by default) but it does not solve my problem. Any help very much appreciated.

Solution

Change your pattern from this:

/([a-zA-Z0-9\-\_]+)\s*=\s*(\{(.*)\}|\d{4})/

/([a-zA-Z0-9\-\_]+)\s*=\s*(\{[^}]+\}|\d{4})/

Or in code:

$s = '{Author = {Smith, John and James, Paul and Hanks, Tom}, Title = {{Some title}}, Journal = {{Journal name text}}, Year = {{2022}}, Volume = {{10}}, Number = {{11}}, Month = {{DEC}}, Abstract = {{Abstract text abstract text, abstract. Abstract text - abstract text? Abstract text! Abstract text abstract text abstract text abstract text abstract text abstract text abstract text abstract text, abstract text. Abstract text abstract text abstract text abstract text abstract text.}}, DOI = {{10.3390/ijms19113496}}, Article-Number = {{1234}}, ISSN = {{1234-5678}}, ORCID-Numbers = {{}}, Unique-ID = {{ISI:1234567890}}, }';
$p = '/(\b[-\w]+)\s*=\s*(\{([^}]+)\}|\d{4})/';

preg_match_all($p, $s, $m);
print_r($m);

Sandbox

This will get you closer, but it needs a bit more refinement. Basically what was happening was you were matching the first { with the last } because the .* matches anything "greedy" which means it consumes all matches it can.

You can get a simular result to above \{[^}]+\} by simply making it non-greedy like this \{(.*?)\} instead of the original \{(.*)\} but I don't think it reads as well.

Output

 ...
[1] => Array
    (
        [0] => Author
        [1] => Title
        [2] => Journal
 ...

[2] => Array
    (
        [0] => {Smith, John and James, Paul and Hanks, Tom}
        [1] => {{Some title} //<--- lost }
        [2] => {{Journal name text} //<--- lost }

The simplest thing to do here is to add a couple optional {} or \}? in, and then at least you can collect the full tags:

  //note the \{\{? and \}?\}
  $p = '/(\b[-\w]+)\s*=\s*(\{\{?([^}]+)\}?\}|\d{4})/';

This changes the 2 index to this:

[2] => Array
    (
        [0] => {Smith, John and James, Paul and Hanks, Tom}
        [1] => {{Some title}}
        [2] => {{Journal name text}}

But as there is no example of the desired results, that's as far as I can go.

As a Side:

Another way to do this (non-regex) would be to trim the {} then explode it }, then loop and explode on the =. And fidget a bit with the format.

Something like this:

$s = '{Author = {Smith, John and James, Paul and Hanks, Tom}, Title = {{Some title}}, Journal = {{Journal name text}}, Year = {{2022}}, Volume = {{10}}, Number = {{11}}, Month = {{DEC}}, Abstract = {{Abstract text abstract text, abstract. Abstract text - abstract text? Abstract text! Abstract text abstract text abstract text abstract text abstract text abstract text abstract text abstract text, abstract text. Abstract text abstract text abstract text abstract text abstract text.}}, DOI = {{10.3390/ijms19113496}}, Article-Number = {{1234}}, ISSN = {{1234-5678}}, ORCID-Numbers = {{}}, Unique-ID = {{ISI:1234567890}}, }';

function f($s,$o=[]){$e=array_map(function($v)use(&$o){if(strlen($v))$o[]=preg_split("/\s*=\s*/",$v."}");},explode('},',trim($s,'}{')));return$o;}

print_r(f($s));

Output

Array
(
    [0] => Array
        (
            [0] => Author
            [1] => {Smith, John and James, Paul and Hanks, Tom}
        )

    [1] => Array
        (
            [0] =>  Title
            [1] => {{Some title}}
        )

    [2] => Array
        (
            [0] =>  Journal
            [1] => {{Journal name text}}
        )
   ...

Sandbox

Uncompressed version:

/* uncompressed */
function f($s, $o=[]){
    $e = array_map(
        function($v) use (&$o){
            if(strlen($v)) $o[] = preg_split("/\s*=\s*/", $v."}");
        },
        //could use preg_split for more flexibility  '/\s*\}\s*,\s*/`
        explode(
            '},',
            trim($s, '}{')
        )
    );
    return $o;
}

It's not as "robust" a solution, but if the format is always like the example it may be sufficient. It looks cool anyway. The output format is a bit better, but you could do array_combine($m[1],$m[2]) to fix the Regex version.

You can also feed it an array and it will append to it, for example:

print_r(f($s,[["foo","{bar}"]]));

Output:

Array
(
[0] => Array
    (
        [0] => foo
        [1] => {bar}
    )

[1] => Array
    (
        [0] => Author
        [1] => {Smith, John and James, Paul and Hanks, Tom}
    )

Then if you want other formats:

//get an array of keys  ['foo', 'Author']
print_r(array_column($a,0));

//get an array of values ['{bar}', '{Smith, John ...}']
print_r(array_column($a,1));

//get an array with keys=>values ['foo'=>'{bar}', 'Author'=>'{Smith, John ...}']
print_r(array_column($a,1,0));

Which of course you could bake right into the function return.

Anyway it was fun, enjoy.

UPDATE

The regex (\{[^}]+\}|\d{4}) means this:

(...) capture group, captures all matches enclosed in ( and )
\{ match { literally
[^}]+ match anything not a } one or more times
\} match } literally
| or
\d{4} match 0-9 4 times.

Basically the problem with this (\{(.*)\} instead of \{[^}]+\} is that the .* also matches } and {, and because it's greedy (not trailing ? such as \{(.*?)\}) it will match everything it can. So in effect it would match this fname={foo}, lname={bar} so that will match everything between the first { and last } or {foo}, lname={bar}. The regex with the "not" } however only matches up to the first } because the [^}]+ will not match the ending } in foo} this is matched by \} instead, which completes the pattern. If we used the other one (.*) it actually matches the last } and captures everything between the first { and last } in the string.

A word on Lexing

Nesting can be really difficult for regex. As I said in the comments a lexer is better. What that involves is instead of matching a large patter like: /([a-zA-Z0-9\-\_]+)\s*=\s*(\{[^}]+\}|\d{4})/ you match smaller patterns like this

[
  '(?P<T_WORDS>\w+)', ///matches a-zA-Z0-9_
  '(?P<T_OPEN_BRACKET>\{)', ///matches {
  '(?P<T_CLOSE_BRACKET>\})',  //matches }
  '(?P<T_EQUAL>=)',  //matches =
  '(?P<T_WHITESPACE>\s+)', //matches \r\n\t\s
  '(?P<T_EOF>\Z+)', //matches end of string
];

You can put these together with an or

  "(?P<T_WORD>\w+)|(?P<T_OPEN_BRACKET>'{')|(?P<T_CLOSE_BRACKET>'}')|(?P<T_EQUAL>'=')|(?P<T_WHITESPACE)\s+|(?P<T_EOF)\Z+",

The (?P<name>..) is a named capture group, just makes things easier. Instead of just matches like:

[
   1 => [ 0 => 'Title', 1 => ''],
]

You will also have this:

[
   1 => [ 0 => 'Title', 1 => ''],
   'T_WORD' => [ 0 => 'Title', 1 => '']
]

It makes it easier to assign the token name back to the match.

Anyway the goal at this stage would bet to get an array (eventually) with "tokens" or the match name like (something) this: eg. Title = {{Some title}}

  //token stream
 [
    'T_WORD' => 'Title',   //keyword
    'T_WHITESPACE' => ' ', //ignore
    'T_EQUAL' => '=',      //instruction to end key,
    'T_WHITESPACE' => ' ', //ignore
    'T_OPEN_BRACKET' => '{', //inc a counter for open brackets
    'T_OPEN_BRACKET' => '{', //inc a counter for open brackets
    'T_WORD' => 'Some',      //capture as value
    'T_WHITESPACE' => ' ',   //capture as value
    'T_WORD' => 'title',     //capture as value
    'T_CLOSE_BRACKET' => '}', //dec a counter for open brackets
    'T_CLOST_BRACKET' => '}', //dec a counter for open brackets
   ]

This should be pretty strait forward, but the key difference is that in pure regex, you can't count the { and } so you have no way to verify the syntax of the string, it either matches or not.

With the lexer version, you can count these things and act appropriate. This is because you can iterate though the token matches, and "test" the string. For example we can say these things:

A word followed by an = is an attribute name. Anything inside { one or two } must end with the same number of { as } and anything inside { and } other then } is some "information" we need. Ignore any space outside of our {} pairs... etc. It give use the "Granularity" we need to validate this type of data.

I mention this because even the example I give you /(\b[-\w]+)\s*=\s*(\{\{?([^}]+)\}?\}|\d{4})/ will fail on strings like this

 Author = {Smith, John and James, {Paul and Hanks}, Tom}

In which it would return matches for

 Author 
{Smith, John and James, {Paul and Hanks}

Another example is this will fail to cause an issue:

Title = {{Some title}, Journal = {{Journal name text}}

Which will give matches like this:

Title 
Some title
//and
Journal 
Journal name text

This looks correct, but is not because the {{Some title} is missing a }. What you do about invalid syntax in your string is up to you, but in the Regex version, we have no control over that. I should mention even a recursive regex ('match pairs of brackets') will fail here returning something like:

{{Some title}, Journal = {{Journal name text}

But in a lexer version we can increment a counter { +1 { +1 then the word Some title then } -1 and we are left with a 1 instead of a 0. So in our code we know that we are missing a } where one should be.

Below are some examples of lexers I have written (there is even an empty one in there)

https://github.com/ArtisticPhoenix/MISC/tree/master/Lexers

It's much harder to implement a lexer (even a basic one) then a pure regex solution, but it will be easier to work with and maintain in the future. Hope that makes sense to explain the difference between matching and lexical analysis.

Essentially, with a big complex pattern, all that complexity is baked into the pattern making it difficult to change. With smaller patterns, the complexity of the pattern emerges as a result of the way it's parsed (your code instructions) making it much easier to adjust for edge cases etc..

Good Luck!