Search code examples
javaescapingsanitizationsanitize

java sanitize String, remove / escape all no language characters, various languages such as Chinese, Spanish, etc


I am trying to sanitize a string in java (that comes from a comment box) and remove special characters and anything strange like an emoji, the challenge is that the comment can be written in several languages like Chinese, Japanese, Spanish, English ect . Does anyone know any library or method to achieve this? Thanks in advance.ç

here an example of the value url: commentText=Thanks+for+your+review%2C+Francesco+%F0%9F%AB%B6

thist is the part that I would like to remove: %F0%9F%AB%B6


Solution

  • I'll answer my own question in case someone finds it useful I solved this using a regular expression:

    String regex = "[^\\p{L}\\p{N}\\p{P}\\p{Z}]";
    String commet = "text to sanitize";
    comment.replaceAll(regex, "");
    

    regexp explanation:

    • \p{L} – to allow all letters from any language
    • \p{N} – for numbers
    • \p{P} – for punctuation
    • \p{Z} – for whitespace separators
    • ^ is for negation, so all these expressions will be whitelisted