Search code examples
securityhashprivacy

"User-friendly" but secure algorithm for anonymising log files


I have a set of IIS log files that I'd like to publish for a research study.

However, these contain some sensitive information that I would like to anonymise, eg:

UserName=XXXX65

I'd like to use an algorithm that retains some "user friendly"-ness for visual inspection of the log files, but which is also secure enough it is impossible / impractical to derive the original UserNames.

I can't just ** out all the UserNames, since it is important to be able to correlate requests from the same username across the logs.

Using SHA1 hashing gives me something like

UserName=AD5CBF0BA0A8646EBDBA6BE1B5DA4FCB1F385D39

which is just about usable,

SHA256 gives:

UserName=C9B84EE0DD2EFA53645D5268602E23A9E788903B31BBEB99C03982D9B50AF70C

Which is starting to get too long to be usable,

and PBKDF2-SHA1 hashing gives

UserName=1000:153JkeeGAqtG2UsHX57RBqm3O0DIkXhF:31BBDlQrUqqeyaMo/ikCJAXRC4fFXf82

which in my opinion is much too long to be usable.

Is there an algorithm that gives a relatively short one way hash but remains secure / non--reversible?

I'm looking for something where you can scan the log files with your eye, and still notice UserName correlations.


Solution

  • One way hashes aren't really anonymous. Why? One can easily verify which user corresponds to which hash:

    1. Say "root" is a user.
    2. You apply hash("root") and it turns out the result is foo. You publish logs containing several references to foo.
    3. I make a smart guess that root is a user on your machine. I then apply hash("root") myself and obtain foo. Now I know which logs correspond to "root".

    So in essence: Hashes are useful when you later want to be able to verify from the published logs that a certain user was the cause of a certain log. Not when the goal is anonymity.

    Plus, hashes are difficult to read.

    I'd generate random pronounceable strings, and map one to each user name. Then publish the logs using the random strings. Truly anonymous and truly readable.

    How to produce random pronounceable strings? Alternate consonants and vowels. Here's how to do it with C (of course, this only produces a random 6 character string. You need more logic to go with it when processing your logs, like: mapping each user name to a string, making sure strings are unique):

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <time.h>
    
    #define NAME_LENGTH 6
    
    #define RAND_CHAR(string) \
      ( (string)[rand () % strlen (string)])
    
    int main (void)
    {
      char vowel[] = "aeiou";
      char consonant[] = "bcdfghjklmnpqrstvwxyz";
      int i;
    
      char rand_name[NAME_LENGTH + 1];
    
      srand (time (NULL));
    
      for (i = 0; i < NAME_LENGTH; i++)
        rand_name[i] = (i % 2) ? RAND_CHAR (vowel) : RAND_CHAR (consonant);
    
      rand_name[NAME_LENGTH] = '\0';
    
      printf ("%s\n", rand_name);
    
      return 0;
    }
    

    Here's some examples it produced for me:

    cemala
    gogipa
    topeqe
    lixate
    fasota
    rironu

    If the number of users you serve is comparable to 125 * 213, you need to generate longer strings, and maybe use separators to make it easy to pronounce:

    cemala-gogipa