Search code examples
phpspecial-characterszend-search-lucenefile-conversionread-unread

how to remove unreadable characters in a context using PHP?


Hi am feeding context to zend_lucene_search and it can search for the word up to special characters and after that it is not searchable.

for example:

    very well to the other job boards � one of the main things that has impressed is the variety of the applications, especially with regards to the background of the candidates" manoj � Head 

if I search for 'boards' I can get it but if I search for one or any string after the unreadable characters, I cannot search it.

How to remove these and I want to get plain text.

I got these kind of characters on converting .docx/pdf files to text.

OR

let me know how to feed only text to zend_search_lucene..

Please help.


Solution

  • You can use following preg_replace function call to remove all non-ASCII (so called special) characters from your string:

    $replaced = preg_replace('/[^\x00-\x7F]+/', '', $str);
    // produces this converted text:
    //    "very well to the other job boards  one of the main things that has impressed
    // is the variety of the applications, especially with regards to the background of the
    // candidates" manoj  Head"