Search code examples
phpregexpcre

Match token and it's context with possible overlapping


I'm processing some text files and want to find certain tokens and also some of the text sorrounding them to get some context. My problem is that I can't find every instance of the tokens if they are close enough to get captured by the context of a token preceding it.

As an example and simplification, let's say I want to find every 5-digit number in some text, and also 20 characters before and after it to get some context.

First I tried something like:

<?php
$text = "Lorem ipsum 11111 dolor sit 22222 amet, consectetur 33333 adipiscing elit, sed do eiusmod tempor 1111 incididunt ut 11111 labore et dolore magna aliqua.";
$nmbrs_tmp = array();
preg_match_all("@.{0,19}[^\d](\d{5})[^\d].{0,19}@s", $text, $nmbrs_tmp);
print_r($nmbrs_tmp);

But it won't capture the 22222 because it's already within the first capture of 11111 and it's context:

//output
Array
(
    [0] => Array
        (
            [0] => Lorem ipsum 11111 dolor sit 22222 ame
            [1] => t, consectetur 33333 adipiscing elit, se
            [2] =>  1111 incididunt ut 11111 labore et dolore ma
        )

    [1] => Array
        (
            [0] => 11111
            [1] => 33333
            [2] => 11111
        )

)

Then I tried with lookaheads and lookbehinds but 1st: lookbehinds must be fixed length, and 2nd: I won't capture the context anymore: "@(?<=.{0,19})[^\d](\d{5})[^\d](?=.{0,19})@s" //this won't work

Ideally, I would love something like this, where I capture every instance of 5-digit numbers, and also get all possible context:

//output
Array
(
    [0] => Array
        (
            [0] => Lorem ipsum 11111 dolor sit 22222 ame
            [1] => sum 11111 dolor sit 22222 amet, consectetur 3
            [2] => 2 amet, consectetur 33333 adipiscing elit, se
            [3] =>  1111 incididunt ut 11111 labore et dolore ma
        )

    [1] => Array
        (
            [0] => 11111
            [1] => 22222
            [2] => 33333
            [3] => 11111
        )

)

If there's just no way to do this with a regex, then I'm open to PHP solutions that involve going through the text multiple times or using more regexes.


Solution

  • Here is a method using match offsets to calculate relevant substrings:

    <?php
    $text = "99999 Lorem ipsum 11111 dolor sit 22222 amet, consectetur 33333 adipiscing elit, sed do eiusmod tempor 1111 incididunt ut 11111 labore et dolore magna aliqua. 99999";
    $nmbrs_tmp = array();
    preg_match_all("@\b\d{5}\b@s", $text, $nmbrs_tmp, PREG_OFFSET_CAPTURE);
    
    foreach ($nmbrs_tmp[0] as $key => $field) {
        $offset = $field[1];
        $start = ( $offset>=20 ? $offset-20 : 0 );
        $length = $offset>=20 ? 45 : 45-(20-$offset);
        $nmbrs_tmp[0][$key][2] = substr( $text, $start, $length );
    }
    
    print_r($nmbrs_tmp);
    

    First we simplify the regex to just find 5-digit numbers (your original regex would miss numbers at the beginning and end of the line).

    Then we match, passing the PREG_OFFSET_CAPTURE flag.

    Finally we use the returned offset to calculate the length of the desired substring (it probably doesn't matter if $length falls off the end of the input but you can adjust it if you care).

    The result is:

    Array
    (
        [0] => Array
            (
                [0] => Array
                    (
                        [0] => 99999
                        [1] => 0
                        [2] => 99999 Lorem ipsum 11111 d
                    )
    
                [1] => Array
                    (
                        [0] => 11111
                        [1] => 18
                        [2] => 99999 Lorem ipsum 11111 dolor sit 22222 ame
                    )
    
                [2] => Array
                    (
                        [0] => 22222
                        [1] => 34
                        [2] => sum 11111 dolor sit 22222 amet, consectetur 3
                    )
    
                [3] => Array
                    (
                        [0] => 33333
                        [1] => 58
                        [2] => 2 amet, consectetur 33333 adipiscing elit, se
                    )
    
                [4] => Array
                    (
                        [0] => 11111
                        [1] => 122
                        [2] =>  1111 incididunt ut 11111 labore et dolore ma
                    )
    
                [5] => Array
                    (
                        [0] => 99999
                        [1] => 159
                        [2] => olore magna aliqua. 99999
                    )
    
            )
    
    )