I am generating log records about user actions. For privacy reasons, these need to be anonymized after N days. However, I also need to run reports against this anonymized data.
I want all actions by real user A to be listed under fake user X in the anonymized logs - records of one user must still remain records of one (fake) user in the logs. This obviously means that I need to have some mapping between real and fake users, which I use when anonymizing new records. Of course, this totally defeats the point of anonymization - if there's a mapping, the original user data can be restored.
Example:
User Frank Müller bought 3 cans of soup.
Three days later, User Frank Müller asked for refund for 3 cans of soup.
When I anonymize the second log entry, the first one has already been anonymized. I still want both log records to point to the same user. Well, that seems almost impossible to me in practice, so I would like to use some method of splitting up data that hopefully allows me to keep as much integrity as possible in the data. Perhaps using the logs as a data warehouse - split everything into facts and just accept the fact that some dimensions cannot be analyzed?
Have you encountered such a scenario before? What are my options here? I obviously need to make some sort of compromise - what has proven effective for you? How to get the most use out of such data?
At the risk of being pedantic, what you describe is not anonymous data, but rather pseudonymous data. That said, have you considered using some sort of keyed hash function such as HMAC-SHA1 to perform the pseudonym generation? You can reach a fair compromise with a scheme like this:
If you do this, there are two main routes of attack to obtain the real identity from the pseudonym.
Pseudonymous data sets are notoriously vulnerable to information fusion attacks -- you have to strip out or "blur" a lot of key correlating information to make the data set resistant to such attacks, but exactly how much you need to strip is a topic of current research.