Search code examples
phpcharacter-encodingfpdfextended-ascii

How do I use Extended ASCII characters in a PHP/PDF document generated by FPDF?


I am trying to create a document that contains Extended ASCII characters. For text coming from the client the following works:

// Convert from UTF-8 to ISO-8859-1 - Deal with Spanish characters
setlocale(LC_ALL, 'en_US.UTF-8');
foreach ($_POST as $key => $value){
    $post[$key] = iconv("UTF-8", "ISO-8859-1", $value);
}

$pdf->Cell(0, 0, $post["Name"], 0, 1);

However, I can't get text in the PHP file to work. For example:

$name = "José";

I don't know what encoding the variable uses. As a result, I can't convert it to ISO-8859-1. The é gets mangled.

Edit: I am rewriting a program that generates PDF documents (some in Spanish). If I copy text from the existing PDFs, I get the following: (which looks normal in the PDF document and in the IDE but can't be printed with FPDF using either CP1252 or ISO-8859-1 fonts).

$Name = "José" // Jos\x65\xcc\x81 - I have no idea what encoding is used for the é

Changing the extended characters to UTF-8 solves the problem:

$Name = "José" // Jos\xC3\xA9  - UTF-8
  1. Does anyone know what kind of encoding I am copying from the existing PDFs?
  2. Is there a way to convert it to UTF-8?
  3. Can users enter this stuff into a browser?

When I convert the UTF-8 encoded characters to ISO-8859-1 for output to FPDF, the PDF contains the three character encoded version of the é.

2nd Edit: Unicode equivalence from Wikipedia

Unicode provides two notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

Which is the long way of paraphrasing @smith's comment that I just need to get TCPDF or something that will properly handle UTF-8. It should be noted that I am getting the error in PHP's iconv, so I not entirely sure that it can be made to go away by switching to TCPDF.


Solution

  • Turns out that to use extended ASCII characters one needs to pick and encoding and use it throughout. In my case, I went with UTF-8 encoded characters and used them everywhere. My original problem stemmed from my mistake in copying text from a PDF document which was encoded in the canonically equivalent format. Once I used UTF-8 encoded characters everywhere my problems went away.