Search code examples
phpmysqlsearchnon-ascii-characters

Searching Geoname database with non-latin characters


I have a copy of the Geonames database stored in a MySQL database, and a PHP application that allows users to search the database for their city. It works fine if they type the city name in English, but I want them to be able to search in their native language.

For example, instead of asking a Japanese speaker to search for Tokyo, they should be able to search for 東京.

The Geonames database contains an alternatenames column with, "alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)."

For example, the alternatenames value for the Tokyo row is Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,Tokió,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - æ±äº¬,Tòquio,Tókýó,Tóquio,TÅkyÅ,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Τόκιο,Токио,Токё,Токіо,ÕÕ¸Õ¯Õ«Õ¸,טוקיו,توكيو,توکیو,طوكيو,ܛܘܟÜܘ,ܜܘܟÜܘ,टोकà¥à¤¯à¥‹,டோகà¯à®•à®¿à®¯à¯‹,โตเà¸à¸µà¸¢à¸§,ტáƒáƒ™áƒ˜áƒ,东京,æ±äº¬,æ±äº¬éƒ½,ë„ì¿„.

Those values don't contain 東京 exactly, but I'm guessing that they contain a form of it that has been encoded or converted in some way. So, I assuming that if I perform the same encoding/conversion on my search string, then I'll be able to match the row. For example:

mysql_query( sprintf( "
    SELECT * FROM geoname 
    WHERE 
        MATCH( name, asciiname, alternatenames ) 
        AGAINST ( %s )  
    LIMIT 1",
    iconv( 'UTF-8', 'ASCII', '東京' )
) );

The problem is that I don't know what that conversion would be. I've tried lots of combinations of iconv(), mb_convert_string(), etc, but with no luck.

The MySQL table looks like this:

CREATE TABLE `geoname` (
 `geonameid` int(11) NOT NULL DEFAULT '0',
 `name` varchar(200) DEFAULT NULL,
 `asciiname` varchar(200) DEFAULT NULL,
 `alternatenames` mediumtext,
 `latitude` decimal(10,7) DEFAULT NULL,
 `longitude` decimal(10,7) DEFAULT NULL,
 `fclass` char(1) DEFAULT NULL,
 `fcode` varchar(10) DEFAULT NULL,
 `country` varchar(2) DEFAULT NULL,
 `cc2` varchar(60) DEFAULT NULL,
 `admin1` varchar(20) DEFAULT NULL,
 `admin2` varchar(80) DEFAULT NULL,
 `admin3` varchar(20) DEFAULT NULL,
 `admin4` varchar(20) DEFAULT NULL,
 `population` int(11) DEFAULT NULL,
 `elevation` int(11) DEFAULT NULL,
 `gtopo30` int(11) DEFAULT NULL,
 `timezone` varchar(40) DEFAULT NULL,
 `moddate` date DEFAULT NULL,
 PRIMARY KEY (`geonameid`),
 KEY `timezone` (`timezone`),
 FULLTEXT KEY `namesearch` (`name`,`asciiname`,`alternatenames`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4

Can anyone point me in the right direction?


Solution

  • When I download the Japan file and set up a database like this:

    CREATE TABLE geonames (
        geonameid SERIAL,
        name varchar(200),
        asciiname varchar(200),
        alternatenames varchar(10000),
        latitude float,
        longitude float,
        featureclass varchar(1),
        featurecode varchar(10),
        countrycode varchar(2),
        cc2 varchar(200),
        admin1code varchar(20),
        admin2code varchar(80),
        admin3code varchar(20),
        admin4code varchar(20),
        population BIGINT,
        elevation INT,
        dem INT,
        timezone varchar(40),
        modificationdate DATE
        ) CHARSET utf8mb4;
    

    Then I load the data like this:

    LOAD DATA INFILE '/tmp/JP.txt' INTO TABLE geonames CHARACTER SET utf8mb4;
    

    And select it like this:

    SELECT alternatenames FROM geonames WHERE geonameid=1850147\G
    

    I get this:

    *************************** 1. row ***************************
    alternatenames: Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,Tokió,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - 東京,Tòquio,Tókýó,Tóquio,Tōkyō,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Τόκιο,Токио,Токё,Токіо,Տոկիո,טוקיו,توكيو,توکیو,طوكيو,ܛܘܟܝܘ,ܜܘܟܝܘ,टोक्यो,டோக்கியோ,โตเกียว,ტოკიო,东京,東京,東京都,도쿄
    

    I can also do a search like this:

    SELECT name FROM geonames WHERE alternatenames LIKE '%,東京,%';
    

    Which is a long way of saying: Note the charset declaration when I created the table. I believe this is what you failed to do when you created your database.