Search code examples
sql-servernon-unicode

How to use ORDER BY, LOWER in SQL SERVER 2008 with non-unicode data


The question is about Armenian. I'm using sql server 2005, collation SQL_Latin1_General_CP1_CI_AS, data mostly is in Armenian and we can't use unicode.

I tested on ms sql 2008 with a windows collation for armenian language ( Cyrillic_General_100_ ), I have found here, ( http://msdn.microsoft.com/en-us/library/ms188046.aspx ) but it didn't help.

I have a function, that orders hex values and a lower function, which takes each char in each string and converts it to it's lower form, but it's not acceptable solution, it works really slow, calling that functions on every column of a huge table.

Is there any solution for this issue not using unicode and not working with hex values manually?

UPDATE:

On the left side are mixed case words, sorted in the right order and with lower case representations on the right side. Hope this will help. Thank You. Words are written in unicode.

  1. ԱբԳդԵզ -> աբգդեզ

  2. ԱգԳսԴԼ -> ագգսդլ

  3. ԲաԴֆդԴ -> բադֆդդ

  4. ԳԳԼասա -> գգլասա

  5. ԴմմլօՏ -> դմմլօտ

  6. ԵլԲնՆն -> ելբննն

  7. ԶՎլուտ -> զվլուտ

  8. էԹփձջՐ -> էթփձջր

  9. ԸխԾդսՂ -> ըխծդսղ

  10. ԹԶէըԿր -> թզէըկր


Solution

  • One solution would be to create a computed column for each text column which converts the value into Armenian collation and sets it to lower case like so:

    Alter Table TableName
        Add TextValueArmenian As ( LOWER(TextColumn COLLATE Latin1_General_CI_AS) ) PERSISTED
    

    Once you do this, you can put indexes on these columns and query for them.

    If that isn't your flavor of tea, then another solution would be an indexed view where you create a view with SCHEMABINDING that casts each of the various columns to lower case and to the right collation and then put indexes on that view.

    EDIT I notice in your examples, that your are using a Case-insensitive, Accent-sensitive. Perhaps the simple solution to your ordering issues would be to use Latin1_General_CS_AS or Cyrillic_General_100_CS_AS if available.

    EDIT

    Whew. After quite a bit of research, I think I have an answer which unfortunately may not be you will want. First, yes I can copy the text you provided into code or something like Notepad++ because StackOverflow is encoded using UTF-8 and Armenian will fit into UTF-8. Second, this hints at what you are trying to achieve: storing UTF-8 in SQL Server. Unfortunately, SQL Server 2008 (or any prior version) does not natively support UTF-8. In order to store data in UTF-8, you have a handful of choices:

    1. Store it in binary and convert it to UTF-8 on the client (which pretty much eliminates any sorting done on the server)
    2. Store it in Unicode and convert it to UTF-8 on the client. It should be noted that the SQL Server driver will already convert most strings to Unicode and your example does work fine in Unicode. The obvious downside is that it eats up twice the space.
    3. Create a CLR user defined type in SQL Server to store UTF-8 values. Microsoft provides a sample that comes with SQL Server to do just this. You can download the samples from CodePlex from here. You can also find more information on the sample in this article in the Books Online. The downside is that you have to have the CLR enabled in SQL Server and I'm not sure how well it will perform.

    Now, that said, I was able to get you sample working with no problem using Unicode in SQL Server.

    If object_id('tempdb..#Test') Is Not Null
        Drop Table #Test
    GO
    Create Table #Test
    (
        EntrySort int identity(1,1) not null
        , ProperSort int 
        , MixedCase nvarchar(50)
        , Lowercase nvarchar(50)
    )
    GO
    Insert #Test(ProperSort, MixedCase, Lowercase)
    Select 1, N'ԱբԳդԵզ',N'աբգդեզ'
    Union All Select 6, N'ԵլԲնՆն',N'ելբննն'
    Union All Select 2, N'ԱգԳսԴԼ',N'ագգսդլ'
    Union All Select 3, N'ԲաԴֆդԴ',N'բադֆդդ'
    Union All Select 4, N'ԳԳԼասա',N'գգլասա'
    Union All Select 5, N'ԴմմլօՏ',N'դմմլօտ'
    Union All Select 9, N'ԸխԾդսՂ',N'ըխծդսղ'
    Union All Select 7, N'ԶՎլուտ',N'զվլուտ'
    Union All Select 10, N'ԹԶէըԿր',N'թզէըկր'
    Union All Select 8,N'էԹփձջՐ',N'էթփձջր'
    
    Select * From #Test Order by ProperSort
    Select * From #Test Order by Lowercase
    Select * From #Test Order by Lower(MixedCase)
    

    All three of these queries return the same result.