i've created a function to get plain text from HTML by striping out JavaScript , CSS , HTML tags etc. for that i've relied upon PHP's preg_replace function to remove certain patterns. The webpages are already stored on hard disk so i'm taking source code from disk. The function is working properly for source code from single files however if i append the source code of multiple files and pass it to my function then preg_replace function fails and returns FALSE . I tried get_last_error but nothing was reported. I'm also trimming the source code before concatinating (to remove EOFs).
Please also tell me how regular expressions are implemented on Windows because unlike Linux there is no grep on Windows.
When you have long HTML files, the preg family of functions will return false, because of a backtrack limitation in PHP ( check here: http://bugs.php.net/bug.php?id=40846 ).
You could try to work on smaller portions of the files and concatenate them after stripping the tags.
Also you could optimize your regular expressions not to use so much backtracking if you rely much on .* . For example
/<.*?>/
Could be optimized as
/<[^>]+>/
And so on.