Search code examples
javaencryptionprivacydata-masking

How To mask personal identification information using any language like java?


I want to mask PII(personal Identification Information) like Name. Birth Date, SSN, Credit card Number, Phone Number, etc. It should remain same formate , means it looks like real data. And shouldn't be reversible.And it should take less time to mask. Any one please help me.


Solution

  • Replacing consonants with consonants, vowels with vowels, and digits with digits:

    import java.util.Random;
    
    public class Example {
    
        static char randomChar (Random r, String cs, boolean uppercase) {
            char c = cs.charAt(r.nextInt(cs.length()));
            return uppercase ? Character.toUpperCase(c) : c;
        }
    
        static String mask (String str, int seed) {
    
            final String cons = "bcdfghjklmnpqrstvwxz";
            final String vowel = "aeiouy";
            final String digit = "0123456789";
    
            Random r = new Random(seed);
            char data[] = str.toCharArray();
    
            for (int n = 0; n < data.length; ++ n) {
                char ln = Character.toLowerCase(data[n]);
                if (cons.indexOf(ln) >= 0)
                    data[n] = randomChar(r, cons, ln != data[n]);
                else if (vowel.indexOf(ln) >= 0)
                    data[n] = randomChar(r, vowel, ln != data[n]);
                else if (digit.indexOf(ln) >= 0)
                    data[n] = randomChar(r, digit, ln != data[n]);
            }
    
            return new String(data);
    
        }
    
        public static void main (String[] args) {
    
            System.out.println(mask("John Doe, 534 West Street, Wherever, XY. (888) 535-3593. 399-35-3535", 0));
    
        }
    }
    

    That produces the output:

        Bumk Tyy, 194 Wyrd Tggoyb, Flikibod, QY. (557) 722-5385. 055-08-1462
    

    From the input:

        John Doe, 534 West Street, Wherever, XY. (888) 535-3593. 399-35-3535
    

    It's up to you to generate the seed. Use a seed based on the input data (e.g. a checksum) as well as a consistent RNG if you want to guarantee that the same input always produces the same output.

    A performance optimization could be made by using a character class table instead of e.g. vowel.indexOf(). Further micro-optimizations could be made (e.g. re-using Random, operating only on char[] and reducing new String allocations, etc.)

    Heavy difficulties will be encountered with full Unicode support. Masking also does not change length of components.

    Over all I would rate this a poor, but at least moderately interesting, algorithm.

    I don't think you understand that what you are asking for (output that looks real) is outside the scope of normal encryption topics and doesn't lend itself well to "efficiency", as some amount of morphological analysis would be required to produce meaningful results (and again, internationalization complicates this significantly).