I'm attempting to remove noise words from a string, and I have what I believe is a good algorithm for it, but I'm running into a snag. Before I do my preg_replace I remove all punctuation except apostrophe ('). The I put it through this preg_replace:
$content = preg_replace('/\b('.implode('|', self::$noiseWords).')\b/','',$content);
Which works great, except for words that do indeed have that ' character. preg_replace seems to be treating that as a boundary character. This is a problem for me.
Is there a way I can get around this? A different solution perhaps?
Thanks!
Here is the example I'm using:
$content = strtolower(strip_tags($content));
$content = preg_replace("/(?!['])\p{P}/u", "", $content);// remove punctuation
echo $content;// i've added striptags for editing as well should still workyep it doesnbsp
$content = preg_replace("/\b(?<')(".implode('|', self::$noiseWords).")(?!')\b/",'',$content);
$contentArray = explode(" ", $content);
print_r($contentArray);
On the 3rd line you'll see the comment of what $content is right before the preg_replace
And though I'm assuming you can guess what my noiseWords array looks like, here's just a small fraction of it:
$noiseWords = array("a", "able","about","above","abroad","according","accordingly","across",
"actually","adj","after","afterwards","again",......)
You can use a negative lookbehind and positive lookahead to make sure you're not "around" a quote character:
$regex = "/\b(?<!')(".implode('|', self::$noiseWords).")(?!')\b/";
Now, your regex will not match anything that is preceded by or following with a single quote.