I have 20MB flat file database with about 500k lines, only [a-z0-9-]
characters are allowed, average 7 words in line, no empty or duplicate lines:
Flat file database:
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
I'm searhcing for whole words only
and extracting first 10k results
from this db.
So far this code work ok if the 10k matches are found in let's say first 20k lines of the db, but if the word is rare, the script must search all 500k lines and this is 10 times slower.
Settings:
$cats = file("cats.txt", FILE_IGNORE_NEW_LINES);
$search = "end";
$limit = 10000;
Search:
foreach($cats as $cat) {
if(preg_match("/\b$search\b/", $cat)) {
$cats_found[] = $cat;
if(isset($cats_found[$limit])) break;
}
}
My php skills and knowledge are limited, I cannot and don't know how to use sql, so this is the best I can do it, but I need some advices:
Thanks for reading this and sorry for bad English, this is my 3rd language.
If most of the lines don't contain the searched word, you could execute preg_match()
less often, like so:
foreach ($lines as $line) {
// fast prefilter...
if (strpos($line, $word) === false) {
continue;
}
// ... then proper search if the line passed the prefilter
if (preg_match("/\b{$word}\b/", $line)) {
// found
}
}
Though, it requires benchmarking in practical situation.