Search code examples
sql-server-2005unicodeasp-classiccollation

Japanese/Chinese language data in SQL Server table


So I've got an interesting problem that I need help with faster than I can get my skills with SQL Server up to par.

We have a table that contains a bunch of text, all of it in different languages. Most of this data appears correctly in the browser, however, anything in Chinese or Japanese gets completely mangled by the browser.

This is an ASP.old app that we're using to display the data which comes from a server running MS SQL Server 2005.

Before, we had this same problem and we solved it by changing the encoding in the ASP pages. These files have not changed since we did that, but the problem has resurfaced. Thus I must conclude that the problem rests with the database since that's the only thing that has been updated since we last fixed it.

So far I've been trying to look into collation, but I'm nowhere near an expert on SQL so it's been difficult.

I can provide more info if needed, anything that will help someone get me to the answer, short of URL's (confidentiality and all).

If anyone has any ideas, I would appreciate it very much.

ADDITIONAL INFO:

-column type is 'ntext'


Solution

  • Collation only affects sort order, not encoding. You need to determine what the encoding of your chinese and japenese content is (see this). If it is not UCS-2, you have a problem (since you cannot support multiple page encodings concurrently). If it is UCS-2, you need to make sure that the encoding of your ASP page is also set to UTF-8 (and that the browser recognizes that by correctly setting the encoding to UTF-8 - see View/Encoding).

    Or in simpler terms: if the application that created the content did not use Unicode characters, you will have to switch page encoding if you switch between Chinese, Japanese, and European characters.

    If you have correctly encoded Unicode content in your database, and you use UTF-8 encoding on your pages, you should not have a problem with displaying any special characters (as long as you use a Unicode font on the page):

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    

    I realize that desite several edits I am not being very clear, so let me add some basics.

    A character set is a standardized representation of a set of characters (e.g. ASCII, UNICODE, ...).

    Character encoding is the binary representation used to store characters of a given character set. ASCII has its own encoding. Unicode, which is a very large character set designed to support all characters in existance, has several encodings (UTF-8, UTF-16, UCS-2, ...).

    Only Unicode gives you the ability to support Western and Far eastern content at the same time with the same database and application settings. There are, however, older character sets for the Chinese and Japenese language that are not Unicode. If your content is not Unicode (BIG 5, for instance), you cannot display it on a UTF-8 encoded web page.

    This can become tricky if the application that created the content used one encoding (e.g. BIG-5) and the database stored it as Unicode data. If this happens, information could have been lost.

    You even have to install the corresponding language packs in Windows in order to see the characters correctly. Unfortunately, encoding issues are not simple to diagnose.