You all did such an amazing job answering a question earlier I thought - I'll ask this one before I get too deep in my conversion only to find out I did something wrong. I only have 3 pages to a website I'm making for myself. It has forms, sqli db. I was told to use UTF-8 (I partially did, but not fully) lol. Ok, sounds cool. Now that I want to fix it to be 100% UTF-8 aware I have already written about 1,900 lines of code in PHP, JS, and HTML without using multibyte functions.. SO... here's my question... in my conversion I have done this... (snippits of code from various places...)
PHP
date_default_timezone_set('America/Toronto'); // sets the timezone to Eastern Stand Time
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
etc
SQL
(from cpanel interface) MySQL connection collation": utf8_general_ci
SQL DB (still in pre utf-8 mode)
username varchar(50) latin1_general_cs
companyname varchar(50) latin1_swedish_ci
fname varchar(25) latin1_swedish_ci
I have NO valuable data in the tables. I will be changing those to one of the following (I'm not sure which one however)...
utf8_general_ci or utf8_unicode_ci
While I would like to make the site available for foreign people, it's not a high priority BUT, since I'm doing it UTF-8 style it's probably already going to work for foreign languages.
My questions are...
1) I set my timezone, i didn't set my locale in php because I have never done that. Do I need to do that? How do I do that for my Toronto/Canada Location?
2) Is setting each page via meta tag ok enough to make the entire page UTF-8
3) By using the meta tag does that mean all my form fields are already being input as UTF-8 data? If not, how do I change it so they are.
4) Which one do I use for my DB? utf8_general_ci or utf8_unicode_ci
5) I NEED certain things to be case sensitive. I only see ci for utf8. Is this because a "Dave" is different than "dave" so using multibyte compares automatically compares case...??!?!?!
6) My DB currently has say 50 characters for storage for ASCII stuff - I assume that by switching to utf-8 in the DB that for english people like myself that 50 storage will be fine - but if some foreign person comes along and enters a bunch of weird symbols I would need to increase my storage by x4 to accomodate all the extra bytes for unicode? I don't mind using up more storage but I'm curious how the proper way to allocate this would be. And since it's a VARCHAR(50) would it really matter anyways? If the name is "Dave" it would be 4 characters. If it was some foreign name, "Dave" in symbols might be 12 characters! lol. So, if I allocate say 100 to the username field that should do since it's unlikely ALL characters would be 4 bytes. Or, just set it to x4 what I would for english and make them all VARCHARS to save space. When they enter data on the form I'll be using MB_LENGTH functions (I forget the exact function) so I would still be able to control how many characters would be input.
7) How can I test my unicode website? I have never used anything other than beautiful english :) lol. How can I switch my browser? to pretend like I'm from somewhere else and enter a pile of codes and see if my functions work once I re-write them to use mb_ (multibyte) functions. Or, is there nothing to switch over... I just type in ALT 245 or something and I get symbols?!?!? I don't know how to enter foreign test characters! It would suck to get english working only to have all foreign customers not able to enter a password because I didn't test my website enough :)
8) I know to use certain functions ctype, mb_ to handle unicode compares, strings, etc. Any surprises in store for me? Things that don't work as they should?
Yes... I'm wordy! :) I use Dreamweaver CS3 but that shouldn't matter. There is no UTF-8 characters embeded in my actual files.
Awaiting all your wisdom...
I'll start with some of the answers:
2) Your server should also send headers that indicate that the content sent is in UTF-8:
header('Content-Type: text/html; charset=UTF-8');
3) Browsers will send their data in UTF-8, yes. But hackers may not, so you should also in your htmlententies
and similar HTML-encoding function give the UTF-8-Charset (see example exploit)
5) A case insensitive collation does only mean that when doing a WHERE-clause, case doesn't play a role.
6) Actually, it is the contrary: in ASCII you may need a bigger VARCHAR than in UTF-8 ("Dave" is 4 chars, 4 bytes; "ǝʌɐp" is 4 chars, 8 bytes.)