Search code examples
phpencodingcharacter-encodingrtf

PHP .rtf encoding problem with polish characters



Got a problem with replacing polish characters through php in rtf file.
I want to find tagwords in rtf file content in replace them with relevant content So what I'm doing:
    // Getting rtf file content
    $content = file_get_contents('<link_to_file_here>');

    // encoding to utf-8
    $content = mb_convert_encoding($content, 'UTF-8');

    // replacing tagword with relevant content
    $content = str_replace('[company_address]', 'Częstochowa', $content);

    // save rtf file with replaced content
    file_put_contents('uploads/test.rtf', $content);
    
    echo $content; 

When i check what happened with rtf file content after this code executed, i've noticed that Częstochowa replaced with Cz\u0119stochowa.
Then i open a new created rtf file in MS Word and see this Częstochowa.
After this i decided to write Częstochowa manually in rtf file and check what happens. I get file content the same way (via file_get_contents) and noticed that MS Word replaced my manually wrote Częstochowa with Cz\\'eastochowa. So i decided to do this:

// replacing tagword with relevant content
$content = str_replace('[company_address]', 'Cz\\\'eastochowa', $content);

And after this i open file in MS Word and see this Czêstochowa
Googled a bit and found that ê is character from Unicode Block “Latin-1 Supplement” (from U+0080 to U+00FF) with code U+00EA but polish characters are in Unicode Block “Latin Extended-A” (from U+0100 to U+017F), so i need to encode rtf file content to it somehow
I tried a lot of things but still didn't solve the problem.
Hope on Your help. Thanks for attention.


Solution

  • Found a solution:

    $string = str_replace('&#', "\\u", mb_convert_encoding('Częstochowa', 'html'));
    $content = str_replace('[company_address]', $string, $content);