I have a huge string and a needle. I want to find out the closest text of that needle from the string. However, the string and needle both are in Unicode(Bengali language). I have a few solution but in English language only. I have found no solution of this in Unicode ( Bengali language). Please see the following examples in Romanian language to understand better about my problem.
SOURCE: "Cei bătrâni fac o băutură toxică pentru regina joviană".
NEEDLE: "băutură pentru toxică "
OUTPUT: "băutură toxică pentru"
SOURCE: "Cei bătrâni fac o băutură toxică pentru regina joviană".
NEEDLE: "bătra pak o băuturărinan"
OUTPUT: "bătrâni fac o băutură"
I found that I can do this using similarity measures like cosine or manhatton similarity measure. However, I think the implementation of this algorithms will be difficult. Would you please suggest me any easy or fastest way to do this maybe using any library function of php for Unicode characters? TIA
I think the fastest way is ShpinxSearch Engine:
It's have mysql-like client. And you can do things like that:
mysql> SELECT * FROM test WHERE MATCH('băutură pentru toxică');
Output is the list of documents ordered by best match.
==============================================================
Or try create word-index table on php (it`s a very simple algoritm must be optimized to your needs):
function near( $src, $needle) {
$hashIndexes = [];
$words = mb_split(' ', $src);
foreach( $words as $k => $w ) {
$w = mb_strtolower( $w, 'utf-8');
$hashIndexes [sha1( $w )] = [ 'key' => $k, 'word' => $w ];
}
$nWords = mb_split(' ', mb_strtolower( $needle, 'utf-8'));
$matches = [];
foreach( $nWords as $k => $w ) {
$hash = sha1( $w );
if( isset( $hashIndexes [ $hash ]) && $w === $hashIndexes [ $hash ] ['word']) {
$matches [] = $hashIndexes [ $hash ] ['key'];
}
}
if( ! empty( $matches )) {
sort( $matches );
$start = $matches [0];
$last = end( $matches );
$result = array_slice( $words, $start, $last - $start + 1);
return implode( ' ', $result );
} else {
return '';
}
}
$src = "Cei bătrâni fac o băutură some other toxică pentru regina joviană";
$needle ="băutură pentru another toxică";
echo near( $src, $needle) . "\n";
==============================================================
Optimization is a great work (google hehehehe).
.
, ,
, ...
, ?
etc from $words
and $nWords
arrays.$hashIndexes [sha1( $w )]
must be an array (becouse sha1 may be same for an other words)$hashIndexes [sha1( $w )] ['key']
must be also an array for an equals words in a text.And i'm realy recommend you to install SphinxSearch or some similar text-search engine.