Despite listing each ISO-8859
character set as an individual encoding, the mbstring functions treat every ISO-8859
character set interchangeably. To drive the point home:
$strings = [
'English' => 'Ea vim decore sapientem repudiandae. Sea cu delenit gamu mutn, tic.',
'Cyrillic' => 'Лорем ипсум долор сит амет, ин ехерци вереар номинати яуи, сит ин омниум инермис но.',
'Greek' => 'Λορεμ ιπσθμ δολορ σιτ αμετ, ηασ γραεcο νθσqθαμ cθ, εστ θτ εσσε διcαμ qθαλισqθε cθ.',
'Armenian' => 'լոռեմ իպսում դոլոռ սիթ ամեթ, եամ նո թաթիոն ծոմպռեհենսամ, իուս ադ նիսլ ոմնիս մինիմ եսթ',
'Georgian' => 'ლორემ იფსუმ დოლორ სით ამეთ, ეხ ყუანდო ცოფიოსაე უსუ, იუს ეუ ჰინც ვერო დომინგ ჰის',
'Hindi' => 'वर्ष एसेएवं व्याख्यान संदेश होने लक्षण एसेएवं पहोचाना विचरविमर्श? वर्णन करती आशाआपस अन्तरराष्ट्रीयकरन. रहारुप कार्यसिधान्त',
'Korean' => '모든 국민은 보건에 관하여 국가의 보호를 받는다, 전직대통령의 신분과 예우에 관하여는 법',
'Arabic' => 'مع لهذه الهجوم عدم, فكان اتفاق الصفحات من أسر. وجزر عُقر أما بـ, عل دار بقسوة المتّبعة بالولايات. وإقامة والفرنسي كل لكل. أي',
'Hebrew' => 'עמוד מדינות, חפש ואלקטרוניקה אנתרופולוגיה דת, מה קהילה הקהילה טכנו'
];
$encodings = ['ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15' ];
foreach( $strings as $lang => $text ) {
echo $lang . " is encoded as " . mb_detect_encoding( $text, $encodings ) . "\n";
foreach( $encodings as $encoding ) {
echo " - is " . (mb_check_encoding( $text, $encoding ) ? "" : "not ") . $encoding . "\n";
}
}
This produces output to the effect of
Hindi is encoded as ISO-8859-1
- is ISO-8859-1
- is ISO-8859-2
- is ISO-8859-3
- is ISO-8859-4
- is ISO-8859-5
- is ISO-8859-6
- is ISO-8859-7
- is ISO-8859-8
- is ISO-8859-9
- is ISO-8859-10
- is ISO-8859-13
- is ISO-8859-14
- is ISO-8859-15
with identical results for every listed language, which is clearly not true.
Why does mbstring list every ISO-8859
encoding separately but treat them interchangeably? Is there any way to reliable detect the proper spec?
Or am I simply misusing these functions?
mb_detect_encoding
makes a guess as to what the encoding might be, it is not possible for this sort of thing to be accurate (and this function doesn't do much to try.)
mb_check_encoding
tells you if a string consists of a byte sequence that is valid for the given encoding, and given that every possible byte is valid in each ISO-8859-* it's pointless to validate against them (these will always return true
.)
For related reading I very much recommend: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets