Search code examples
phppostgresqlutf-8character-encodingpervasive

Encoding error when reading data from one DB and inserting to another with PHP


With PHP, I am trying to read data from Pervasive DB v9.5 and insert it to PostgreSQL 9.3 (encoding: UTF-8) on Windows 2008. I did not choose or code PervasiveDB (I am just reading data from it). With ODBC I read data from Pervasive and write it to console with no problem. However when I try to insert it to Postgre I encounter with

Warning: pg_execute(): Query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0x94 in file.php on line ..

So, I saw that Postgres did not like the string I gave.

Then I use

var_dump(iconv_get_encoding('all'));

and see that my encoding is ISO-8859-1

and modify the string with

iconv ( 'ISO-8859-1' , 'UTF-8' , $a)

Now, the error is gone. However the string which reached to Postgres is not correct.

The code I used is below. And my test string is aöaçaşaıağaüaÖaÇaŞaİaĞaÜ

$a is the string which comes from Pervasive

echo $a; 

gives aöaçaşaıağaüaÖaÇaŞaİaĞaÜ

echo iconv ( 'ISO-8859-1' , 'UTF-8' , $a)

gives a┬öa┬ça┬şa┬ıa┬ğa┬üa┬Öa┬Ça┬Şa┬İa┬Ğa┬Ü

<?php
//var_dump(iconv_get_encoding('all'));

$conn = pg_connect("host=localhost port=5432 dbname=xxx user=xxx password=".$argv[1]);

$result = pg_prepare($conn, "my_query", 'SELECT * FROM func_my_deneme($1)');

$connect_string = "DRIVER={Pervasive ODBC Client Interface}; SERVERNAME=localhost; SERVERDSN=xxx;";
$pervasiveconn = odbc_connect($connect_string, 'xxx', 'xxx');

$pervasive_result = odbc_exec($pervasiveconn ,"SELECT something");

while(odbc_fetch_row($pervasive_result)){
  $a=odbc_result($pervasive_result,1);

  echo $a;

  $result = pg_execute($conn, "my_query", array(iconv ( 'ISO-8859-1' , 'UTF-8' , $a)));
}
?>

Solution

  • You only seem to be looking at one of the two encoding exchanges here.

    You have:

    (pervasive's native encoding) -> (PHP string)
    

    and

    (PHP string) -> (PostgreSQL)
    

    Of these, you're only explicitly handling the second. You're assuming that the data Pervasive's ODBC driver returns is in PHP's default encoding, which on your system is iso-8859-1.

    Your tests suggest that assumption may be correct, but simply echo'ing the string isn't a good way to tell, because that introduces another encoding step:

    (PHP string) -> (whatever decodes it for viewing)
    

    be that a web browser, terminal or whatever. If the viewer expects some encoding that happens to be the same as Pervasive is using it will decode the output corectly.

    Try:

    echo $a;
    echo "aöaçaşaıağaüaÖaÇaŞaİaĞaÜ";
    

    and make sure the viewer shows the same value for both. Make sure you edit your source file with the encoding set to iso-8859-1, not some other encoding, so that the literal bytes of the string you paste are correct.

    At that point you should get an error if your editor is set correctly because not all those characters are legal in iso-8859-1. The first invalid one is ş.

    So clearly what's coming from Pervasive can't be iso-8859-1. To really print a latin-1 string, you can echo the escaped bytes. For example, this string:

    aöaçaaaüaÖaÇaaaaÜ
    

    in which all chars are legal iso-8859-1, is printed in iso-8859-1 encoding with:

    echo "a\xf6a\xe7aaa\xfca\xd6a\xc7aaaa\xdc"
    

    Here, hex escapes are used to specify non-7-bit characters to unambiguously ensure that the encoding of the byte sequence is what you think without any confusion about text editors etc.

    Betcha that doesn't print right when you view it, because whatever's reading the input isn't decoding it as iso-8859-1.


    What you should be doing is looking at the bytes of the string you get from Pervasive to see what it really is. Then determining its encoding and decoding it into utf-8, which you can then send to PostgreSQL over a client_encoding = utf-8 connection. @deceze suggested bin2hex for this (I don't speak PHP, so didn't know what to suggest). So show the output of:

    echo bin2hex($a) . "\n";
    

    Or - even better - make sure you determine from the configuration / documentation what the encoding of the data coming from Pervasive is, rather than guessing. Or just force it.

    A quick look at the Pervasive documentation showed that the ODBC Driver has an encoding parameter that takes the code page ID for the desired encoding. So try:

    $connect_string = "DRIVER={Pervasive ODBC Client Interface}; SERVERNAME=localhost; SERVERDSN=xxx; encoding=65001";
    

    (Microsoft, at least, defines 65001 as the codepage for utf-8 per this doc).