How would I go about implementing Boyer-Moore Search for streams? I understand how to implement this for a given string where we know the length of the entire string. However, what if we have no idea the size of the string (i.e it's an arbitrary length stream of bytes).
I'm trying to implement this in PHP so any code in PHP would be helpful.
Here's the implementation of Boyer-Moore Search I have in PHP:
function BoyerMooreSearch($haystack, $needle) {
$needleLen = strlen($needle);
$haystackLen = strlen($haystack);
$table = computeSkipTable($needle);
for ($i = $needleLen - 1; $i < $haystackLen;) {
$t = $i;
for ($j = $needleLen - 1; $needle[$j] == $haystack[$i]; $j--, $i--) {
if($j == 0) {
return $i;
}
}
$i = $t;
if(array_key_exists($haystack[$i], $table)) {
$i = $i + max($table[$haystack[$i]], 1);
} else {
$i += $needleLen;
}
}
return false;
}
function computeSkipTable($string) {
$len = strlen($string);
$table = [];
for ($i=0; $i < $len; $i++) {
$table[$string[$i]] = $len - $i - 1;
}
return $table;
}
This works fine if we give it a haystack string like "barfoobazquix"
and a needle string like "foo"
, it will return 3
as expected. However, what if the input haystack were a stream split into 4 byte chunks. The first chunk would be "barf"
, which would return no match, the second would be "ooba"
which also returns no match, and so on...
In this scenario we'd never be able to find the substring "foo"
in any of the buffered chunks of the stream with the current implementation. I'm struggling to adapt the current implementation such that it would work even if the search string were split across multiple chunks.
Note that you only ever use the $needleLen
most recent characters from the stream. So you can maintain a sliding window consisting of $needleLen
characters and advance it as needed. Furthermore, $haystackLen
is now unknown and should be replaced with EOF checks.
The code below is inefficient because sliding the window always takes O(n)
regardless if we slide it by n
characters or just 1. To achieve optimal sliding complexity, the $window
should be implemented as a circular character buffer.
// sample call
var_dump(BoyerMooreSearch(STDIN, 'foo'));
function BoyerMooreSearch($resource, $needle) {
$needleLen = strlen($needle);
$table = computeSkipTable($needle);
// build the first window
if (($window = buildWindow($resource, $needleLen)) === false) {
// special case: the text is shorter than the pattern
return false;
}
$i = 0;
while (true) {
// check for a match
$j = $needleLen - 1;
while ($j >= 0 && $needle[$j] == $window[$j]) {
$j--;
}
if ($j < 0) {
return $i;
}
// determine slide distance
$t = $i;
$last = substr($window, -1);
if (array_key_exists($last, $table)) {
$i = $i + max($table[$last], 1);
} else {
$i += $needleLen;
}
// slide the window
if (($window = slideWindow($window, $resource, $i - $t)) === false) {
return false;
}
}
return false;
}
/**
* Initializes the sliding window to length $len.
*
* @return string A string of length $len or false if the $resource contains
* less than $len characters.
*/
function buildWindow ($resource, $len) {
$result = '';
while ($len--) {
$result .= fgetc($resource);
}
return feof($resource) ? false : $result;
}
/**
* Slides $window by $count positions into $resource.
*
* @return string The new window or false on EOF.
*/
function slideWindow(&$window, $resource, $count) {
$window = substr($window, $count) // discard the first $count chars
. fread($resource, $count); // and read $count new chars
return feof($resource) ? false : $window;
}