Search code examples
utf-8windows-1252codepage-437

Convert mangled characters back to UTF-8


Here is what I did:

  1. I dumped a SQLite database with UTF-8 data (sqlite3 example.db .dump > dump.sql), but since this was in powershell, I assume the piping converted it to windows-1252
  2. I loaded that dumped data into a new database, again using powershell (Get-Content dump.sql | sqlite3 example2.db)
  3. I dumped that new database and am left with a new .sql file (this time it was not through powershell - so I assume it was unmodified)

This new sql file's UTF-8 characters are seriously mangled, and I was wondering if there was a way to convert it back into correct UTF-8.

As a few examples, here are what some sequences are in the new file, and what they should be (all are viewed as UTF-8):

  1. ÒüéÒü¬ÒüƒÒü½ should be あなたに
  2. ´╝ü should be a full width exclamation mark
  3. Òé¡Òé╗Òé¡ should be キセキ

Does anyone have any idea as to how I might undo this mangling? Any method would be very helpful!

This is in powershell 7.0.1

Edit:

On further inspection, you can duplicate my predicament by redirecting any such data to a file in powershell (note that the data cannot itself be entered in powershell). Hence, setting up a script like this gives the same outcome:

test.sh

#!/bin/bash
echo "キ"

And then running wsl ./test.sh > test.txt will give an output of Òé¡, not

Edit 2:

It seems as if the codepage the UTF-8 text was converted to is almost 437: some characters are restored using this assumption (e.g. ), but others are not. If it's close to 437, but isn't, what could it be?


Solution

  • It turns out, since I am in the UK, the codepage I wanted was 850. Saving the file as 850 and then reloading it as UTF-8 fixed my issue!