Search code examples
postgresqlutf-8character-encoding

How to diagnose character encoding issue


I am having trouble identifying an issue that seems to be related to strange characters that exist in a Postgres database. I'm pulling data out of Postgres using Java and loading it into BigQuery. Occasionally I've noticed that some of the values seem to get changed during the process for no apparent reason. Upon closer examination, I've found that in all cases the issue seems to be caused by characters that are not what I would consider normal.

The Postgres database encoding is UTF-8. Java encoding is also UTF-8.

Here is an example of what I'm seeing:

I have a text field that contains this value: SÅ‚awomir.

If I run this SQL:
select length('SÅ‚awomir')

I get a value of 9 which looks correct. However, if I export that string to a text file and look at it in a HEX editor (in my case Visual Studio Code using the HEX Editor extension), it looks like the length of that string is 11, not 9. Upon closer examination, the second and third characters are represented by 2 HEX values rather than just one like the other characters are. Those second and third characters are represented by the following 4 HEX values:

C3 85 C2 82

Here's a screenshot of the HEX editor showing those characters. As you can see, the string seems to have 11 characters, not 9: enter image description here

Please help me understand what those characters are and what I can do about it. Are they valid UTF-8 characters? If so, why should they get converted by a Java program and how can I stop this from happening?

Update 2023-10-31: Thank you @Laurenz Albe for your response. It makes good sense as an explanation for what happened (and how to prevent it in the future), but I'm not sure it fully addresses my issue because I don't have the ability to control the upstream process that inserts data into the database.

I have a few more details that are relevant:

We use Google Datastream to pull the data from Postgres and move it to BigQuery. When the data arrives in BigQuery, it looks exactly the same as it did in Postgres (which is what I want). The issue actually occurs when I use Java (JDBC) to pull that value out of BigQuery and then insert it into another BigQuery table.

I don't do this as a single insert statement like "Insert into... select from..". In that case, the data would never leave BigQuery. What I do is fetch the data first and assign the results to a Java variable. Then in a second step I insert that value back into another BigQuery table. When I do this, the data in the destination table gets changed slightly, so I'm trying to figure out how to prevent that from happening.

Here is an example of the original value and the value after being moved to another table:

enter image description here

Here is a screenshot of that same file in my HEX viewer:

enter image description here

As you can see, the value has been changed somewhat - the new value seems to be c3 85 e2 80 9a

So my issue is really how can I preserve the original value? There seems to be something happening within the process of pulling the data into Java and then putting it back into BigQuery. My Java env is configured to use UTF8 encoding, so I'm a bit puzzled as to how I can preserve the original value.


Solution

  • This is a case of “double encoding”.

    The original string must have been “Słavomir”. The second letter (ł) is encoded with the two bytes C582 in UTF-8.

    Now when that UTF-8 encoded string was inserted into the database, somebody set the PostgreSQL client encoding to a single-byte encoding, probably LATIN-1. As a consequence, PostgreSQL interpreted the two bytes as individual characters: C5 is “Å”, and 82 is an unprintable character, a control character with the name “break permitted here”.

    PostgreSQL converted these two characters to the server encoding UTF-8, which turned them into the four bytes you observe. Each of the characters is represented by two bytes in UTF-8.

    During the conversion to BigQuery, the “break permitted here” character gets converted to a “curly quote” (‚). What must happen is this:

    • the data get extracted with client encoding LATIN-1, so it comes out as C582

    • the client gets inserted into BigQuery with a client encoding of WINDOWS-1252, where 82 means the curly quote

    • the BigQuery server converts the curly quote to ZTF-8 and ends up with E2809A


    To sum up:

    • the original data in your PostgreSQL database are already broken, because client_encoding was set to LATIN1 instead of UTF8 when the data were imported

    • during transfer to BigQuery, the data were further mutilated, because the BigQuery client encoding was set to WINDOWS-1252