Search code examples
unicodecharacter-encodingspecial-charactersnon-ascii-charactershtmlspecialchars

Character Encoding Issue - Characters Being Replaced with Random Characters after Saving in Textarea


I'm working with a third-party company and I'm trying/hoping to determine the cause of a character encoding issue before I bring it up with them.

This company has a custom drag and drop editor for designing websites on their platform. Within the editor they have a Raw HTML widget that I can drag in and add my own content too. The problem is that when I copy HTML from someones old website, using the inspector tool, and paste it into this widget of theirs, all of the apostrophe's & double quotes get replaced with 'jibberish'. I also have the same issue when I try pasting the content into notepad, notepad++, sublime editors and then pasting it into their Raw HTML editor.

Here's a recording of the issue and a few examples: https://streamable.com/phwn2

Here are the known characters that get replaced and what they get replaced

  • ’ turns into â™

  • “ turns into âœ

  • ” turns into â

  • + turns into (a space)

  • Å turns into Ã…

  • " stays as "

  • ' stays as '

Does anyone see a pattern with these characters or know what could be the cause of these characters being replaced?


Solution

  • The website probably has UTF-8 encoding, and the company's editor might be using something like Windows-1252 encoding. In your first example, the right single quote has UTF-8 encoding e2 80 99. When each of those bytes is read by a program using Windows-1252, you get "small latin letter a with circumflex" (e2), [undefined] 80 and "trademark" (99). I haven't checked the other transformations. If this is the problem, then you could do a workaround by first converting the copied characters to the destination encoding with iconv, before pasting into the company's editor.