Find and replace non utf8 character

I have a process that inserts data into PDFs that eventually loads into a system that gets searched based on that inserted data. The inserted data looks something like:

<<
/IBM-ODIndexes
<< /Private
<<
  /DOB (05031983)
  /FULL_NAME (TEST USER)
  /YEAR (2020)
>>
/LastModified(D:20210112201530)
>>

However, there are instances where the data in the FULL_NAME field contains non UTF8 characters and then users are unable to search the data. Specifically apostrophes come over from Microsoft Word and then gets interpreted like this:

/FULL_NAME (JERRY OÃ<83>Â¢Ã¢â<80><9a>Â¬Ã¢â<80><9e>Â¢CONNELL)

In this case I am looking to strip out the apostrophe that is represented as Ã<83>Â¢Ã¢â<80><9a>Â¬Ã¢â<80><9e>Â¢ and replace it with a white space.

Solution

There are several complexities here, but in general I would say that the only reliable way to deal with it is to figure out the text encoding of the incoming document and converting it to the target encoding.

Ã<83>Â¢Ã¢â<80><9a>Â¬Ã¢â<80><9e>Â¢ is 34 characters (that is, at least 34 bytes), and no single encoding ever used that much space for a single character. What’s probably happening is multiple levels of encoding, such as HTML entities, base64, UTF-8/16/32 or escape characters like %% to represent % in SQL or \\ to represent \ in Bash. Reversing all these levels of encoding manually is going to involve quite a lot of reading the huge docx standard. The simpler alternative is to use a library which can just convert the entire text into a known character encoding for you, at which point you have to do at most a single conversion into UTF-8.

Another argument for this is that the “apostrophe string” does contain otherwise harmless characters like “a” and “e”. Without at least some understanding of the encodings you’re unlikely to be able to separate encoded characters from non-encoded ones, which would make the resulting text full of invalid text.