Search code examples
javaunicodeutf-8aws-sdkamazon-sqs

Remove invalid characters from message sent to AWS/Amazon SQS


Context: Amazon SQS has a constraint on ranges of characters it will accept when a message passed in the argument to the sqsClient.sendMessage(...). (Mentioned here).

Exerpt from the above link:

A message can include only XML, JSON, and unformatted text. The following Unicode characters are allowed:

#x9 | #xA | #xD | #x20 to #xD7FF | #xE000 to #xFFFD | #x10000 to #x10FFFF

Any characters not included in this list will be rejected.

Question: For now, we know offending characters are present in the message json which is sent as a message, so we filter them out by message_json.replaceAll("\uffff", ""); and this works fine. (where '\uffff' is the java representation of the xFFFF/U+FFFF character).

However, instead of only doing for the xFFFF character, I want to do this for the entire ranges mentioned above(#x9 | #xA | #xD | #x20 to #xD7FF | #xE000 to #xFFFD | #x10000 to #x10FFFF) but how do I construct a clause that can take range of characters without running replace on each one?


Solution

  • Actually, the answer was right in front of me. For some reason, I had assumed that the character classes of a regex will not accept these escaped chars such as [\ufffd-\uffff] inside message_json.replaceAll("[\ufffd-\uffff]", " ");

    This works for my case.