Search code examples
azureencodingdatabricks

Character Encoding Issue in Creating a Persistent View with Non-ASCII (Greek Characters) in Databricks


I have tried to create a persistent view with a simple select statement that just has a word with greek characters.

CREATE OR REPLACE VIEW misc.XX AS 
SELECT 'ΤΕΣΤ'

If I try to select that view, I get ???? instead of the word I typed.

Select Statement

The issue seems to be that the view definition is not saved correctly.

View Details

Are there any workarounds here? The problem happens only with views, when writing as table, characters appear as expected.


Solution

  • The issue was that we were storing view in the Hive Metastore (HMS). It's important to note that HMS in Databricks has a limitation: it cannot accommodate non-ASCII characters within view definitions.

    However, if you opt for Unity Catalog, this limitation no longer applies, and you can freely work with non-ASCII characters in your view definitions.


    For those constrained by the use of HMS, there is a potential workaround available. You can store text in a hex-encoded format using UTF-8 encoding.

    1. Start by encoding the text in a hex format using UTF-8 encoding. To do this, you can use the following SQL command:
    SELECT hex(encode("ΤΕΣΤ", "UTF-8")); --> CEA4CE95CEA3CEA4
    
    1. Next, take the output from the command above and save it within a view definition. This can be achieved by creating or replacing a view such as:
    CREATE OR REPLACE VIEW misc.TEST AS
    SELECT decode(unhex('CEA4CE95CEA3CEA4'), 'UTF-8') AS col1;
    
    1. You can now query the view and get non-ASCII characters.