Search code examples
javaalgorithmstringsearchstream

Efficient way to search a stream for a string


Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.

Naively, I might try to do something like this (in Java):

public boolean streamContainsString(Reader reader, String searchString) throws IOException {
    char[] buffer = new char[1024];
    int numCharsRead;
    while((numCharsRead = reader.read(buffer)) > 0) {
        if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
            return true;
    }
    return false;
}

Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:

Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"

How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?

Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.


Solution

  • I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on github and it is written in Haxe to target multiple programming languages at once, including Java.

    I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter.

    class StreamOrientedKnuthMorrisPratt {
        var m: Int;
        var i: Int;
        var ss:
        var table: Array<Int>;
    
        public function new(ss: String) {
            this.ss = ss;
            this.buildTable(this.ss);
        }
    
        public function begin() : Void {
            this.m = 0;
            this.i = 0;
        }
    
        public function partialSearch(s: String) : Int {
            var offset = this.m + this.i;
    
            while(this.m + this.i - offset < s.length) {
                if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
                    if(this.i == this.ss.length - 1) {
                        return this.m;
                    }
                    this.i += 1;
                } else {
                    this.m += this.i - this.table[this.i];
                    if(this.table[this.i] > -1)
                        this.i = this.table[this.i];
                    else
                        this.i = 0;
                }
            }
    
            return -1;
        }
    
        private function buildTable(ss: String) : Void {
            var pos = 2;
            var cnd = 0;
    
            this.table = new Array<Int>();
            if(ss.length > 2)
                this.table.insert(ss.length, 0);
            else
                this.table.insert(2, 0);
    
            this.table[0] = -1;
            this.table[1] = 0;
    
            while(pos < ss.length) {
                if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
                {
                    cnd += 1;
                    this.table[pos] = cnd;
                    pos += 1;
                } else if(cnd > 0) {
                    cnd = this.table[cnd];
                } else {
                    this.table[pos] = 0;
                    pos += 1;
                }
            }
        }
    
        public static function main() {
            var KMP = new StreamOrientedKnuthMorrisPratt("aa");
            KMP.begin();
            trace(KMP.partialSearch("ccaabb"));
    
            KMP.begin();
            trace(KMP.partialSearch("ccarbb"));
            trace(KMP.partialSearch("fgaabb"));
    
        }
    }