Search code examples
databaseoracle-databaserandomoracle10gdata-masking

Oracle Data Masking using random names from a temp table


We need to mask some Personally Identifiable Information in our Oracle 10g database. The process I'm using is based on another masking script that we are using for Sybase (which works fine), but since the information in the Oracle and Sybase databases is quite different, I've hit a bit of a roadblock.

The process is to select all data out of the PERSON table, into a PERSON_TRANSFER table. We then use a random number to select a random name from the PERSON_TRANSFER table, and then update the PERSON table with that random name. This works fine in Sybase because there is only one row per person in the PERSON table.

The issue I've encountered is that in the Oracle DB, there are multiple rows per PERSON, and the name may or may not be different for each row, e.g.

|PERSON|
:-----------------:
|PERSON_ID|SURNAME|
|1        |Purple |
|1        |Purple |
|1        |Pink   | <--
|2        |Gray   |
|2        |Blue   | <--
|3        |Black  |
|3        |Black  |

The PERSON_TRANSFER is a copy of this table. The table is in the millions of rows, so I'm just giving a very basic example here :)

The logic I'm currently using would just update all rows to be the same for that PERSON_ID, e.g.

|PERSON|
:-----------------:
|PERSON_ID|SURNAME|
|1        |Brown  |
|1        |Brown  |
|1        |Brown  | <--
|2        |White  |
|2        |White  | <--
|3        |Red    |
|3        |Red    |

But this is incorrect as the name that is different for that PERSON_ID needs to be masked differently, e.g.

|PERSON|
:-----------------:
|PERSON_ID|SURNAME|
|1        |Brown  |
|1        |Brown  |
|1        |Yellow | <--
|2        |White  |
|2        |Green  | <--
|3        |Red    |
|3        |Red    |

How do I get the script to update the distinct names separately, rather than just update them all based on the PERSON_ID? My script currently looks like this

DECLARE
    v_SURNAME    VARCHAR2(30);

BEGIN

    select pt.SURNAME
    into  v_SURNAME
    from   PERSON_TRANSFER pt
    where   pt.PERSON_ID = (SELECT PERSON_ID FROM
                            ( SELECT PERSON_ID FROM PERSON_TRANSFER
                            ORDER BY dbms_random.value )
                            WHERE rownum = 1);
END;

Which causes an error because too many rows are returned for that random PERSON_ID.

1) Is there a more efficient way to update the PERSON table so that names are randomly assigned? 2) How do I ensure that the PERSON table is masked correctly, in that the various surnames are kept distinct (or the same, if they are all the same) for any single PERSON_ID?

I'm hoping this is enough information. I've simplified it a fair bit (the table has a lot more columns, such as First Name, DOB, TFN, etc.) in the hope that it makes the explanation easier.

Any input/advice/help would be greatly appreciated :)

Thanks.


Solution

  • One of the complications is that the same surname may appear under different person_id's in the PERSON table. You may be better off using a separate, auxiliary table holding surnames that are distinct (for example you can populate it by selecting distinct surnames from PERSONS).

    Setup:

    create table persons (person_id, surname) as (
      select 1, 'Purple' from dual union all
      select 1, 'Purple' from dual union all
      select 1, 'Pink'   from dual union all
      select 2, 'Gray'   from dual union all
      select 2, 'Blue'   from dual union all
      select 3, 'Black'  from dual union all
      select 3, 'Black'  from dual
    );
    create table mask_names (person_id, surname) as (
      select 1, 'Apple'  from dual union all
      select 2, 'Banana' from dual union all
      select 3, 'Grape'  from dual union all
      select 4, 'Orange' from dual union all
      select 5, 'Pear'   from dual union all
      select 6, 'Plum'   from dual
    );
    commit;
    

    CTAS to create PERSON_TRANSFER:

    create table person_transfer (person_id, surname) as (
    select ranked.person_id, rand.surname
    from   ( select person_id, surname, 
                    dense_rank() over (order by surname) as rk
             from   persons
           ) ranked 
           inner join 
           ( select surname, row_number() over (order by dbms_random.value()) as rnd
             from   mask_names
           ) rand
                  on ranked.rk = rand.rnd
    );
    commit;
    

    Outcome:

    SQL> select * from person_transfer order by person_id, surname;
    
     PERSON_ID SURNAME
    ---------- -------
             1 Pear
             1 Pear
             1 Plum
             2 Banana
             2 Grape
             3 Apple
             3 Apple
    

    Added at OP's request: The scope has been extended - the requirement now is to update surname in the original table (PERSONS). This can be best done with the merge statement and the join (sub)query I demonstrated earlier. This works best when the PERSONS table has a PK, and indeed the OP said the real-life table PERSONS has such a PK, made up of the person_id column and an additional column, date_from. In the script below, I drop persons and recreate it to include this additional column. Then I show the query and the result.

    Note - a mask_names table is still needed. A tempting alternative would be to just shuffle the surnames already present in persons so there would be no need for a "helper" table. Alas that won't work. For example, in a trivial example persons has only one row. To obfuscate surnames, one MUST come up with surnames not in the original table. More interestingly, assume every person_id has exactly two rows, with distinct surnames, but those surnames in every case are 'John' and 'Mary'. It doesn't help to just shuffle those two names. One does need a "helper" table like mask_names.

    New setup:

    drop table persons;
    
    create table persons (person_id, date_from, surname) as (
      select 1, date '2016-01-04', 'Purple' from dual union all
      select 1, date '2016-01-20', 'Purple' from dual union all
      select 1, date '2016-03-20', 'Pink'   from dual union all
      select 2, date '2016-01-24', 'Gray'   from dual union all
      select 2, date '2016-03-21', 'Blue'   from dual union all
      select 3, date '2016-04-02', 'Black'  from dual union all
      select 3, date '2016-02-13', 'Black'  from dual
    );
    
    commit;
    
    select * from persons;
    
    
     PERSON_ID DATE_FROM  SURNAME
    ---------- ---------- -------
             1 2016-01-04 Purple
             1 2016-01-20 Purple
             1 2016-03-20 Pink
             2 2016-01-24 Gray
             2 2016-03-21 Blue
             3 2016-04-02 Black
             3 2016-02-13 Black
    
    7 rows selected.
    

    New query and result:

    merge into persons p
      using (
              select ranked.person_id, ranked.date_from, rand.surname
              from ( 
                     select person_id, date_from, surname, 
                            dense_rank() over (order by surname) as rk
                     from   persons
                   ) ranked 
              inner join ( 
                     select surname, row_number() over (order by dbms_random.value()) as rnd
                     from   mask_names
                   ) rand
              on ranked.rk = rand.rnd
            ) t
      on (p.person_id = t.person_id and p.date_from = t.date_from)
    when matched then update
      set p.surname = t.surname;
    
    commit;
    
    select * from persons;
    
     PERSON_ID DATE_FROM  SURNAME
    ---------- ---------- -------
             1 2016-01-04 Apple
             1 2016-01-20 Apple
             1 2016-03-20 Orange
             2 2016-01-24 Plum
             2 2016-03-21 Grape
             3 2016-04-02 Banana
             3 2016-02-13 Banana
    
    7 rows selected.