php sorting unicode multibyte-functions multibyte-characters

How to sorting strings in unicode using a predefined alphabet?

I have a mysql table with words in unicode using signs like ḥ, ḫ š, etc. The columns in the table are defined as utf8mb4_general_ci and recognize the above signs.

In the header of the webpage I put

<meta http-equiv="Content-Type" content="text/html; charset=utf8mb4">

This webpage contains a form sending data to a php page. In the beginning of the php page I put:

mysqli_set_charset($con,"utf8mb4");

In this page, I do a mysql search and I get an array and it is this array ($result) must be sorted by its keys using a lookup array of characters that I have produced which includes single and multi-byte characters.

This is the array:

Array ( 
[nṯr] => Array ( [0] => Ka.C.Coptite.urkVIII,176b [1] => Ka.C.Coptite.urkVIII,177,1 ) 
[n] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḫȝḫȝ] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nwj] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nfr] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḥḥ] => Array ( [0] => Ka.C.Coptite.urkVIII,176e [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,1 ) 
[nḏ] => Array ( [0] => Ka.C.Coptite.urkVIII,177,1 ) 
)

What I do is:

uksort($result, 'compare_keys_by_alphabet');

This refers to the function:

function compare_keys_by_alphabet($a, $b)
{
    static $alphabet = array( 1 => "-" , 2 => "," , 3 => ".", 4 => "ȝ", 5 => "j", 6 => "ʿ", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "ḥ", 16 => "ḫ", 17 => "ẖ", 18 => "s", 19 => "š", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "ṯ", 25 => "d", 26 => "ḏ", 27 => "⸗", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "0", 42 => "1", 43 => "2", 44 => "3", 45 => "4", 46 => "5", 47 => "6", 48 => "7", 49 => "8", 50 => "9", 51 => "&", 52 => "@", 53 => "%");

    return compare_by_alphabet($alphabet, $a, $b);
}

using:

function compare_by_alphabet(array $alphabet, $str1, $str2) {
    $c = max(strlen($str1), strlen($str2));

    for ($i = 0; $i < $c; $i++) {
        $s1 = $str1[$i];
        $s2 = $str2[$i];
        //if ($s1===$s2) continue;
        $i1 = array_search($s1, $alphabet);
        //if ($i1===false) continue;
        $i2 = array_search($s2, $alphabet);
        //sif ($i2===false) continue;
        if ($i2==$i1) continue;
        if ($i1 < $i2) return -1;
        else return 1;
    }
    return 0;
}

This worked perfect with the non-unicode alphabet:

static $alphabet2 = array( 1 => '-' , 2 => ',' , 3 => '.' , 4 => "A", 5 => "j", 6 => "a", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "H", 16 => "x", 17 => "X", 18 => "s", 19 => "S", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "T", 25 => "d", 26 => "D", 27 => "=", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "1", 42 => "2", 43 => "3", 44 => "4", 45 => "5", 46 => "6", 47 => "7", 48 => "8", 49 => "9", 50 => "0", 51 => "&", 52 => "@", 53 => "%");

but once I replaced for example H (nr 15) in alphabet2 with ḥ in alphabet1 it didn't work anymore.

I suppose it has to do with recognizing the unicode, because as long as the words do not contain any special signs, the order is correct; but all words containing special signs are put at the beginning of the result.

I tried to look at unicode normalization; but I'm really only an amateur, so this is quite difficult.

Is this the problem or is there another problem and how can I fix it?

Solution

I've left all of my testing echoes in my code block and merely commented them out in case you wanted to see what is being generated throughout the process.

I took some liberties with your code. I didn't like the function calling the function, and I condensed your lookup array into a space-led string. This will serve to have the same effect as your indexed array that starts from 1. The converting of the lookup from array to string means I can use mb_strpos() instead of array_search().

The crucial point to fix in your code was in the looping, specifically accessing the letters with [$i]. You see, you cannot treat these multibyte characters as single byte characters -- you must use mb_substr() to access the "whole" letter.

Setting values for $alphabet and encoding means, you don't have to write a second "helper" function to pass all of the necessary data. uksort() will pass its expected two arguments and everything goes ahead smoothly.

One final piece of advice is: mb_ functions are expensive, so always try to return in your code as soon as possible and leave the mb_ functions farther "downscript" whenever logically possible.

Here is my suggested code: (Demo)

function alphabetize_custom($a, $b, $alphabet = " -,.ȝjʿwbpfmnrhḥḫẖsšqkgtṯdḏ⸗/()[]<>{}'*#I0123456789&@%", $encoding = 'UTF-8') {
    //echo "\n----\n$a =vs= $b";
    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        //echo "\n";
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        //echo "$a_char -vs- $b_char\n";
        //echo "(" , mb_strlen($a_char, $encoding), " & ", mb_strlen($b_char, $encoding), ")\n";
        if ($a_char === $b_char) {/*echo "identical, continue";*/ continue;}
        if (!mb_strlen($a_char, $encoding)) { /* echo "a is empty -1";*/ return -1;}
        if (!mb_strlen($b_char, $encoding)) { /*echo "b is empty 1";*/ return 1;}
        $a_offset = mb_strpos($alphabet, $a_char, 0, $encoding);
        $b_offset = mb_strpos($alphabet, $b_char, 0, $encoding);
        //echo "[" , $a_offset, " & ", $b_offset, "]\n";
        if ($a_offset == $b_offset) { /*echo "== offsets, continue";*/ continue;}
        if ($a_offset < $b_offset) { /*echo "a offset -1";*/ return -1;}
        //echo "b offset 1";
        return 1;
    }
    //echo "0";
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);

Output:

array (
  'n' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nwj' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nfr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nḥḥ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176e',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḫȝḫȝ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nṯr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176b',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḏ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
)

Just for comparison's sake, I wrote an alternative code block that uses array_search() as your original code does and not surprisingly it appears to be more efficient according to the speed tests on 3v4l.org. This is likely due to the removal of a couple of 4 mb_ functions, which I previously mentioned to be "expensive". The following snippet provides the same output.

Code: (Demo)

function alphabetize_custom($a, $b) {
    $alphabet = [' ', '-', ',', '.', 'ȝ', 'j', 'ʿ', 'w', 'b', 'p', 'f', 'm', 'n', 'r', 'h', 'ḥ', 'ḫ', 'ẖ', 's', 'š', 'q', 'k', 'g', 't', 'ṯ', 'd', 'ḏ', '⸗', '/', '(', ')', '[', ']', '<', '>', '{', '}', "'", '*', '#', 'I', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '&', '@', '%'];
    unset($alphabet[0]);  // removes dummy first key, effectively starting the keys from 1
    $encoding = 'UTF-8';

    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        if ($a_char === $b_char) continue;

        $a_key = array_search($a_char, $alphabet);
        $b_key = array_search($b_char, $alphabet);
        if ($a_key === $b_key) continue;

        return $a_key - $b_key;
    }
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);