Search code examples
phpregexpreg-matchpcre

RegEx for capturing groups between repeated words


The key words are "*OR" or "*AND".

Suppose I have the string below:

This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.

I want the following

group1 "This is a t3xt with special characters like !#."  
group2 "*AND"  
group3 "and this is another text with special characters"  
group4 "*AND"  
group5 "this repeats"  
group6 "*OR"  
group7 "do not repeat"  
group8 "*OR"  
group9 "have more strings"  
group10 "*AND"  
group11 "finish with this string."  

I have tried like this:

(.+?)(\*AND\*OR)

but it only gets the first string then I need to keep repeating the code to collect the others, but the problem is that there are strings that have only one *AND, or only one *OR or dozens of it, that is pretty random. And the regex below also does not work:

((.+?)(\*AND\*OR))+

For example:

This is a t3xt with special characters like !#. *AND and this is another text with special characters


Solution

  • PHP has a preg_split function for this sort of thing. preg_split allows you to split a string by a delimiter you can define as a regex pattern. In addition, it has an argument that allows you to include the matched delimiter in the matched/split results.

    So, instead of writing a regex to match the full text, the regex is for the delimiter itself.

    Example:

    $string = "This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.";
    $string = preg_split('~(\*(?:AND|OR))~',$string,0,PREG_SPLIT_DELIM_CAPTURE);
    print_r($string);
    

    Output:

    Array
    (
        [0] => This is a t3xt with special characters like !#. 
        [1] => *AND
        [2] =>  and this is another text with special characters 
        [3] => *AND
        [4] =>  this repeats 
        [5] => *OR
        [6] =>  do not repeat 
        [7] => *OR
        [8] =>  have more strings 
        [9] => *AND
        [10] =>  finish with this string.
    )
    

    But if you really want to stick with using preg_match, you will instead need to use preg_match_all, which is similar to preg_match (what you tagged in your question), except that it does global/repeated matches.

    Example:

    $string = "This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.";
    preg_match_all('~(?:(?:(?!\*(?:AND|OR)).)+)|(?:\*(?:AND|OR))~',$string,$matches);
    print_r($matches);
    

    Output:

    Array
    (
        [0] => Array
            (
                [0] => This is a t3xt with special characters like !#. 
                [1] => *AND
                [2] =>  and this is another text with special characters 
                [3] => *AND
                [4] =>  this repeats 
                [5] => *OR
                [6] =>  do not repeat 
                [7] => *OR
                [8] =>  have more strings 
                [9] => *AND
                [10] =>  finish with this string.
            )
    
    )
    

    First, note that unlike preg_split, preg_match_all (and preg_match) return a multi-dim array, not a single-dim. Secondly, technically, the pattern I used could be simplified a bit, but it would come at a cost of having to reference multiple arrays in the multi-dim array returned (one array for the matched text, and another array for the matched delimiters), that you would then have to loop through and alternate reference; IOW there would be additional cleanup to get a final single array with both match sets, as above.

    I only show this method because you technically asked for it in your question, but I recommend using preg_split, as it takes away a lot of this overhead, and why it was created in the first place (to better solve scenarios like this).