Data anonymisation : way to proceed

I often use production data in my dev environment for testing. However, due to some sensitive data, I need to do a data anonymisation. I have identified the sensitive data, like name, address etc... and for the name field, for example, I am planning just to do an update which will set the values to a random one. I was wondering however if this is an effective way of data anonymisation. Any idea?

Solution

You could do this manually as you suggest by replacing personal info with random strings. Even better, if you want to maintain some validity libraries like faker for python can help. If you do this with any sort of regularity though hardcoded solutions will end up falling over with schema changes.

There’s also a bunch of mathematical theory around the best ways to anonymise a dataset and there's plenty of examples of sensitive data being linked back to an individual. This is often because the dataset wasn’t properly anonymised or it was combined with publicly available data. However, it's definitely safer to work with anonymised data in test.