Search code examples
javaregexxmldata-masking

Performance optimized approach to mask sensitive data in XML in Java


I am creating a project which connects to multiple thirdparty APIs. So, as an audit, I track all the requests and responses sent to and from these APIs. These requests and responses are of the form XML. And these XML requests and responses contain some sensitive information that I need to mask such as PII and Credit Card Numbers.

These are sample tags that are available in the XML

<myTag>someSensitiveInformation</myTag>
<myTag sensitiveInfo = foo, sensitiveTwo = bar>SomeOtherSensitiveInfo</myTag>
<myTag sensitiveInfo = foo, sensitiveTwo = bar>

I could mask them with the following regex

(<myTag)([\s\S]*?)(\/>)|(<myTag)([\s\S]*?)(>)([\s\S]+?)(<\/myTag>)

And the masked tags in all the above cases would look like this,

<myTag>*************</myTag>

This worked fine. But when the traffic is high, this regex evaluation makes CPU spikes and sometime the entire project freezes. Some of these XML requests and responses are around size 100kb. I do have multiple requests and responses corresponding to a single user operation which all of them should be masked from the above regex and it does when there is low traffic to my project.

Is there an optimized way to do this. And yes, I am aware that regex is not recommended to XML tag identification, but this seems to be the easiest approach. Any external libraries that do this kind of masking without the cost of performance, I prefer not to use log4j masking because it seem to accumulate the logs inside the JVM. Or what would be the appropriate solution in java for this kind of scenarios.

Thanks in advance.


Solution

  • Don't use regular expressions to process XML. Firstly, it can be very inefficient. Secondly, which is more important in this case, it is almost always incorrect; an attacker who knows what you are doing will be able to construct XML that defeats your regular expression, for example by careful insertion of comments or whitespace or namespace declarations into the tags you are looking for.

    Your objectives seem confused: you ask for an "optimized" way, but you are using regex because it is "easiest". Also, you haven't said what the "masked" data should look like (don't expect me to reverse engineer your requirements from the regex - you might find regexes easy to write, but no-one finds them easy to read).

    If you have a performance requirement, you need to quantify it. If you can't quantify it, then try coding it in XSLT and see whether it's fast enough; my guess is that it almost certainly is. If you really need better performance than that, then try doing it in SAX; but then you're a long way from it being "easiest".

    In XSLT 3.0 it's simply:

    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      version="3.0">
      <xsl:mode on-no-match="shallow-copy"/>
      <xsl:template match="myTag">
        <myTag>**************</myTag>
      </xsl:template>
    </xsl:transform>
    

    and I think compares favourably with your regex solution on all counts: performance, readability, and above all correctness.