Search code examples
javatextcharacter-encodingmining

What are these mystery characters


This might not be a programming question, but I could not find any answer for it on Google.

I have some text mining task and doing data cleaning at the moment. I have come across some mystery characters far to often which are not in readable format.

These characters are: &#x003b2 , &#x00025 and so on.

All of these starts with a specific pattern and hence I believe they represent some encoding which is not readable to Excel.

Is there any way to convert them? I need to know what exactly these characters mean in order to know if I should remove them or not.


Solution

  • Those are probably Unicode characters written as HTML entities in hexadecimal format.

    • &#x003b2 is the "GREEK SMALL LETTER BETA" (β).
    • &#x00025 is the "PERCENT SIGN" (%).