I'm processing some text files and want to find certain tokens and also some of the text sorrounding them to get some context. My problem is that I can't find every instance of the tokens if they are close enough to get captured by the context of a token preceding it.
As an example and simplification, let's say I want to find every 5-digit number in some text, and also 20 characters before and after it to get some context.
First I tried something like:
<?php
$text = "Lorem ipsum 11111 dolor sit 22222 amet, consectetur 33333 adipiscing elit, sed do eiusmod tempor 1111 incididunt ut 11111 labore et dolore magna aliqua.";
$nmbrs_tmp = array();
preg_match_all("@.{0,19}[^\d](\d{5})[^\d].{0,19}@s", $text, $nmbrs_tmp);
print_r($nmbrs_tmp);
But it won't capture the 22222 because it's already within the first capture of 11111 and it's context:
//output Array ( [0] => Array ( [0] => Lorem ipsum 11111 dolor sit 22222 ame [1] => t, consectetur 33333 adipiscing elit, se [2] => 1111 incididunt ut 11111 labore et dolore ma ) [1] => Array ( [0] => 11111 [1] => 33333 [2] => 11111 ) )
Then I tried with lookaheads and lookbehinds but 1st: lookbehinds must be fixed length, and 2nd: I won't capture the context anymore: "@(?<=.{0,19})[^\d](\d{5})[^\d](?=.{0,19})@s" //this won't work
Ideally, I would love something like this, where I capture every instance of 5-digit numbers, and also get all possible context:
//output Array ( [0] => Array ( [0] => Lorem ipsum 11111 dolor sit 22222 ame [1] => sum 11111 dolor sit 22222 amet, consectetur 3 [2] => 2 amet, consectetur 33333 adipiscing elit, se [3] => 1111 incididunt ut 11111 labore et dolore ma ) [1] => Array ( [0] => 11111 [1] => 22222 [2] => 33333 [3] => 11111 ) )
If there's just no way to do this with a regex, then I'm open to PHP solutions that involve going through the text multiple times or using more regexes.
Here is a method using match offsets to calculate relevant substrings:
<?php
$text = "99999 Lorem ipsum 11111 dolor sit 22222 amet, consectetur 33333 adipiscing elit, sed do eiusmod tempor 1111 incididunt ut 11111 labore et dolore magna aliqua. 99999";
$nmbrs_tmp = array();
preg_match_all("@\b\d{5}\b@s", $text, $nmbrs_tmp, PREG_OFFSET_CAPTURE);
foreach ($nmbrs_tmp[0] as $key => $field) {
$offset = $field[1];
$start = ( $offset>=20 ? $offset-20 : 0 );
$length = $offset>=20 ? 45 : 45-(20-$offset);
$nmbrs_tmp[0][$key][2] = substr( $text, $start, $length );
}
print_r($nmbrs_tmp);
First we simplify the regex to just find 5-digit numbers (your original regex would miss numbers at the beginning and end of the line).
Then we match, passing the PREG_OFFSET_CAPTURE flag.
Finally we use the returned offset to calculate the length of the desired substring (it probably doesn't matter if $length
falls off the end of the input but you can adjust it if you care).
The result is:
Array
(
[0] => Array
(
[0] => Array
(
[0] => 99999
[1] => 0
[2] => 99999 Lorem ipsum 11111 d
)
[1] => Array
(
[0] => 11111
[1] => 18
[2] => 99999 Lorem ipsum 11111 dolor sit 22222 ame
)
[2] => Array
(
[0] => 22222
[1] => 34
[2] => sum 11111 dolor sit 22222 amet, consectetur 3
)
[3] => Array
(
[0] => 33333
[1] => 58
[2] => 2 amet, consectetur 33333 adipiscing elit, se
)
[4] => Array
(
[0] => 11111
[1] => 122
[2] => 1111 incididunt ut 11111 labore et dolore ma
)
[5] => Array
(
[0] => 99999
[1] => 159
[2] => olore magna aliqua. 99999
)
)
)