Search code examples
phpregexpreg-match-allcontinue

Validate a string and return a dynamic number of isolated words


I would like to validate my input string and extract an unpredictable number of substrings from it -- with one regex pattern.

An example string:

location in [chambre, cuisine, salle-de-bain, jardin]

In only one step, I want to verify that the shape is word in [word, word, word...] and I would like to catch each word. (I want to do it in only one step for performance, because this code already works with three steps, but it's too long)

My current regular expression is:

/([a-zA-Z]+)\s+in\s+\[\s*([a-zA-Z-]+)\s*(?:,\s*([a-zA-Z-]+)\s*)*\s*\]/

I catch location, chambre and jardin. I don't catch cuisine and salle-de-bain.

$condition = 'location in [chambre, cuisine, salle-de-bain, jardin]';
preg_match('/([a-zA-Z]+)\s+in\s+\[\s*([a-zA-Z-]+)\s*(?:,\s*([a-zA-Z-]+)\s*)*\s*\]/', $condition, $matches);
var_dump($matches);
array:4 [▼
  0 => "location in [chambre, cuisine, salle-de-bain, jardin]"
  1 => "location"
  2 => "chambre"
  3 => "jardin"
]

I don't find what is wrong in my regular expression to catch the 2 missing words. I only get the first one and the last one in array.


Solution

  • In PHP, repeated capturing groups will always keep the last substring captured only.

    You can use preg_match_all with a regex like

    [a-zA-Z]+(?=\s+in\s+\[\s*[a-zA-Z-]+(?:\s*,\s*[a-zA-Z-]+)*\s*])|(?:\G(?!^)\s*,\s*|(?<=[a-zA-Z])\s+in\s+\[\s*)\K[a-zA-Z-]+(?=(?:\s*,\s*[a-zA-Z-]+)*\s*])
    

    See the regex demo. Details:

    • [a-zA-Z]+(?=\s+in\s+\[\s*[a-zA-Z-]+(?:\s*,\s*[a-zA-Z-]+)*\s*]) - one or more ASCII letters that are immediately followed with in enclosed with one or more whitespace chars, then [, zero or more whitespaces, one or more ASCII letters or hyphens, then zero or more repetitions of a comma enclosed with zero or more whitespaces and then one or more ASCII letters or hyphens, then zero or more whitespaces and a ] char
    • | - or
    • (?:\G(?!^)\s*,\s*|(?<=[a-zA-Z])\s+in\s+\[\s*)\K[a-zA-Z-]+(?=(?:\s*,\s*[a-zA-Z-]+)*\s*]):
      • (?:\G(?!^)\s*,\s*|(?<=[a-zA-Z])\s+in\s+\[\s*) - end of the previous match and a comma enclosed with zero or more whitespaces or a location immediately preceded with an ASCII letter, then one or more whitespaces, in, one or more whitespaces, [ and zero or more whitespaces
      • \K - omit the text matched so far
      • [a-zA-Z-]+ - one or more ASCII letters or hyphens
      • (?=(?:\s*,\s*[a-zA-Z-]+)*\s*]) - a positive lookahead that requires zero or more repetitions of a comma enclosed with zero or more whitespaces and then one or more ASCII letters or hyphens, then zero or more whitespaces and a ] char.