I am trying to optimize the search for a string in a large text file (300-600mb). Using my current method, it is taking too long.
Currently I have been using IndexOf
to search for the string, but the time it takes is way too long (20s) to build an index for each line with the string.
How can I optimize searching speed? I've tried Contains()
but that is slow as well. Any suggestions? I was thinking regex match but I don't see that having a significant speed boost. Maybe my search logic is flawed
example
while ((line = myStream.ReadLine()) != null)
{
if (line.IndexOf(CompareString, StringComparison.OrdinalIgnoreCase) >= 0)
{
LineIndex.Add(CurrentPosition);
LinesCounted += 1;
}
}
The brute force algorithm you're using performs in O(nm) time, where n is the length of the string being searched and m the length of the substring/pattern you're trying to find. You need to use a string search algorithm:
Boyer-Moore is "the standard", I think: http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
But there are lots more out there: http://www-igm.univ-mlv.fr/~lecroq/string/
including Morris-Pratt: http://www.stoimen.com/blog/2012/04/09/computer-algorithms-morris-pratt-string-searching/
and Knuth-Morris-Pratt: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
However, using a regular expression crafted with care might be sufficient, depending on what you are trying to find. See Jeffrey's Friedl's tome, Mastering Regular Expressions for help on building efficient regular expressions (e.g., no backtracking).
You might also want to consult a good algorithms text. I'm partial to Robert Sedgewick's Algorithms in its various incarnations (Algorithms in [C|C++|Java])