Search code examples
sqlsql-servert-sqlselectwindow-functions

Pseudo-anonymize User IDs within locations


Suppose I have the following table

|   City |   Name |
|--------|--------|
|  Paris |    Tim |
|  Paris |  Alice |
|  Paris | Sandra |
|  Paris | Sandra |
|  Paris |   Lisa |
| London |    Joe |
| London |    Joe |
| London |    Tim |
| London |  Peter |

And I want generate an (pseudo) "anonymized" name per city for each unique user (for instance city name per city suffixed by _n where n is a simple count for each unique name per city).

I could do the following

SELECT City
       , Name
       , CONCAT(City, '_', ROW_NUMBER() OVER (PARTITION BY City ORDER BY Name)) AS anon_name
  FROM employees
 GROUP BY City
          , Name;

which gives me what I want:

|   City |   Name | anon_name |
|--------|--------|-----------|
| London |    Joe |  London_1 |
| London |  Peter |  London_2 |
| London |    Tim |  London_3 |
|  Paris |  Alice |   Paris_1 |
|  Paris |   Lisa |   Paris_2 |
|  Paris | Sandra |   Paris_3 |
|  Paris |    Tim |   Paris_4 |

However, I am not very happy with the solution as the users can be derived as the number is simply assigned alphabetically.

Instead of a sequence I could use NEWID():

SELECT City
       , Name
       , CONCAT(City, '_', NEWID()) AS anon_name
  FROM employees
 GROUP BY City
          , Name;
|   City |   Name |                                   anon_name |
|--------|--------|---------------------------------------------|
| London |    Joe | London_A06BFFC9-E122-4CB2-AFDE-B1980EC50A1F |
| London |  Peter | London_B153D072-F75A-4004-B922-189D64656AC1 |
| London |    Tim | London_AF40F949-5D55-4896-907F-17EE0165C569 |
|  Paris |  Alice |  Paris_9F779CFA-B799-409A-801B-9CC044295B13 |
|  Paris |   Lisa |  Paris_CFA8AB45-CCEF-4223-B86B-8A71A8335D13 |
|  Paris | Sandra |  Paris_1FEE5266-8FFF-431E-A76D-CB490FE77BB4 |
|  Paris |    Tim |  Paris_A7BAF19F-1B97-470D-BF41-81C5B6BFFCBA |

This generates quite long labels, but it is not possible to relate it to the name.

Hence, my question:

Can I get the short version (that is just a number suffix) which is however randomized such that the name cannot be easily derived from the number, such as in

|   City |   Name | anon_name |
|--------|--------|-----------|
| London |    Joe |  London_2 |
| London |  Peter |  London_1 |
| London |    Tim |  London_3 |
|  Paris |  Alice |   Paris_1 |
|  Paris |   Lisa |   Paris_2 |
|  Paris | Sandra |   Paris_4 |
|  Paris |    Tim |   Paris_3 |

Sample Data generated by:

CREATE TABLE employees
    ([City] varchar(8), [Name] varchar(8))
;
    
INSERT INTO employees
    ([City], [Name])
VALUES
    ('Paris', 'Tim'),
    ('Paris', 'Alice'),
    ('Paris', 'Sandra'),
    ('Paris', 'Sandra'),
    ('London', 'Joe'),
    ('London', 'Joe'),
    ('London', 'Tim'),
    ('London', 'Peter')
;

Solution

  • You can use newid() in the first expression:

    SELECT City, Name,
           CONCAT(City, '_', ROW_NUMBER() OVER (PARTITION BY City ORDER BY NEWID())) AS anon_name
    FROM employees
    GROUP BY City, Name;
    

    If you just want a randomized suffix, you could use checksum():

    SELECT City, Name,
           CONCAT(City, '_', CHECKSUM(Name)) AS anon_name
    FROM employees
    GROUP BY City, Name;
    

    However, there is the risk of collision -- where two users with the same name have different checksums. There are ways around this, if you don't want sequential ids.