Search code examples
phpmysqlcsvutf-8ansi

Read .csv ANSI file and write to mysql


I have one problem. I have excel file saved as CSV and I need to read that file with PHP and insert into mysql but problem is with char set specifically čćšđž. I tried utf8_encode() and almost everything I could think of.

Examle:

It inserts "Petroviæ" but it should be "Petrović"

EDIT:

<?php

mysql_connect("localhost", "user", "pw");  
mysql_select_db("database");  


$fajl = "Prodajna mreza.csv"; 
$handle = @fopen($fajl, "r");
if ($handle) {
    $size = filesize($fajl);

    if(!$size) {
        echo "File is empty.\n";
        exit;
    }

    $csvcontent = fread($handle,$size);
    $red = 1;

    foreach(explode("\n",$csvcontent) as $line) {
        if(strlen($line) <= 20)
        {
            $red++;
            continue;
        }
        if($red == 1)
        {
            $red++;
            continue;
        }
        $nesto = explode(",", $line);

        if($nesto[0] == '')
            continue;

        mysql_query("INSERT INTO table(val1, val2, val3, val4, val5, val6, val7, val8) VALUES ('".$nesto[0]."','".$nesto[1]."','".$nesto[2]."','".$nesto[3]."','".$nesto[4]."','".$nesto[5]."','".$nesto[6]."','".$nesto[7]."')");  

        $red++;
    }
    fclose($handle);
}
mysql_close();
?>

Solution

  • First off: Using this mysql extension is discouraged. So you might want to switch to something else. Also notice that the way you compose your query by simply pasting strings makes it vulnerable to SQL injection attacks. You should only do this if you are really really sure that there won't be any ugly surprises in the content of the files you read.

    It appears that neither your file reading nor the client-side mysql code does anything related to charset conversion, so I'd assume that those simply pass on bytes, without caring about their interpretation. So you only have to make sure that the server interprets those bytes correctly.

    Judging from the example you gave, where a ć got turned into an æ, I'd say your file is in ISO-8859-2 but the database is reading it differently, most probably as ISO-8859-1. You should ensure that your database actually can accept all ISO-8859-2 characters for its columns. Read the MySQL manual on character set support and set some suitable default characterset (probably best on the database level) either to utf8 (preferred) or latin2. You might have to recreate your tables for this change to apply.

    Next, you should set the character set of the connection to match that of the file. So utf8 is definitely wrong here, and latin2 the way to go.

    • Using your current API, mysql_set_charset("latin2") can be used to accomplish that.
    • That page also describes equivalent approaches for use with other frontends.
    • As an alternative, you can use a query to set this: mysql_query("SET NAMES 'latin2';");

    After all this is done, you should also ensure that things are set up correctly for any script which reads from the database. In other words, the charset of the generated HTML must match the character_set_results of the MySQL session. Otherwise it might well be that things are stored correctly in your database but still appear broken when displayed to the user. If you have the choice, I'd say use utf8 in that case, as doing so makes it easier to include different data whenever the need arises.

    If some problems remain, you should pinpoint whether they are while reading from file into php, while exchanging data with mysql, or while presenting the result in HTML. The string "Petrovi\xc4\x87" is the utf8 representation of your example, and "Petrovi\xe6" is the latin2 form. You can use these strings to explicitely pass on data with a known encoding, or to check an incoming transferred value against one of these strings.