i'm uploading images into a little cms on my php server, and now i have a file called "1372609671-Terrassenböden Watrawood.jpg" which causes some serious problems. i have downloaded everything to my mac and debugged everything down... facing that:
in my mysql table, everything seems fine, the "ö" appears as "ö" and i can find the file when i write a search-query with the exact filename:
But my php code fails, doing the same query. When i get the filename through the filesystem, with readdir
, the resulting query seems strange:
as you can notice, the "ö" is no real "ö" anymore.. it is slightly bigger, but not as big as a big "Ö".. even the cursor is fun, i can stop in the middle of the character, when i hit then Backspace to delete the char, it first deletes the points over it, and on the second time the remaining "o"..
when i convert the filename using e.g. rawurlencode
i got this:
you can see an "o" before the utf-8 stuff starts.. and then a %CC giving the dots and %88 giving a kind of space... what the hell is this? how can i get this down to a simple utf-8 "ö", cause using this stuff for a search-query will be useless.. :-/
For more details, the database schema:
CREATE SCHEMA IF NOT EXISTS `cms` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci ;
DROP TABLE IF EXISTS `upload`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `upload` (
`id` int(11) NOT NULL auto_increment,
`file_name` varchar(255) NOT NULL,
`file_type` varchar(20) NOT NULL,
`file_path` varchar(255) NOT NULL,
`timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`session_id` varchar(45) default NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=8965 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
everything is so far utf-8 on my cms:
<meta charset="utf-8">
There's nothing wrong with what you have here. It's an o followed by U+0308 COMBINING DIAERESIS, which is a correct way to produce an ö. It's called a "decomposed form", while U+00F6 LATIN SMALL LETTER O WITH DIAERESIS is a "composed form". Decomposed forms are more general, while not every character has a composed form (they mostly exist for backwards compatibility). There's nothing not "real" about the decomposed form, and if it displays wrong in your editor it's only because your editor has poor Unicode support. When it comes to searching, again, any correctly-working search engine should treat U+006F U+0308
exactly the same as U+00F6
.
However, if you do need to work with broken stuff, what you want is Unicode Normalization, provided in PHP by the normalizer class. NFKC should give you the form you expect.