c#character-encoding special-characters latin mojibake

Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#

I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes such as & uuml; and more problematic characters representing the same letters such as Ã¼ and Ãƒ. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú and ó.

An example of the sort of string I am dealing with is

DesinfektionslÃƒÂ¶sungstÃƒÂ¼cher fÃƒÂ¼r FlÃƒÂ¤chen

Which should equate to

50 Tattoo Desinfektionsl ö    sungst ü    cher f ü    r Fl ä    chen 
50 Tattoo Desinfektionsl ÃƒÂ¶ sungst ÃƒÂ¼ cher f ÃƒÂ¼ r Fl ÃƒÂ¤ chen

Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of Ã¼ and Ãƒ to UTF-8?

Else what approach would be advisable?

Also is the paragraph character ¶ in the above example string an actual paragraph character or part of some other character combination?

I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is.

Ã‰ -> É
â€œ -> "
â€ -> "
Ã‡ -> Ç
Ãƒ -> Ã
Ã©, 'é
Ã  -> À
Ãº -> ú
â€¢ -> -
Ã˜ -> Ø
Ãµ -> õ
Ã -> í
Ã¢ -> â
Ã£ -> ã
Ãª -> ê
Ã¡ -> á
Ã© -> é
Ã³ -> ó
â€“ -> –
Ã§ -> ç
Âª -> ª
Âº -> º
Ã  -> à

Solution

Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding.

There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken.

What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around:

byte[] data = Encoding.Default.GetBytes(input);
string output = Encoding.UTF8.GetString(data);

The Encoding.Default uses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.