Search code examples
pythonfaker

How to generate a fake name using Faker() passing existing name as the seed_instance


I have a dataframe with customer names which I need to use for test data purposes, but need to obfuscate the names. The name needs to be deterministic: if the same name exists in the table then it should be obfuscated with the same 'fake' name.

For example: Susan H both need to have the same 'Fake' name

FullName FakeName
Susan H John F
Eva B Sarah E
Susan H John F

I have discovered Faker() for this purpose. How can I adapt the below so that I can pass in the existing name as the 'seed_instance' so that the resulting 'fake' name will be the same for all instances of that name in the dataframe?

from faker import Faker
import pyspark.sql.functions as F

fullname_list = [[1,"Sarah Markwaithe"]
,[2,"John Bellamy"]
,[3,"Jordan Fingleberry"]
,[4,"Susan Merchant"]
,[5,"Bobby Franker"]
,[6,"Sally Smith-Holdern"]
,[7,"Finley Farringdon"]
,[8,"Sarah Markwaithe"]
,[9,"Simone Grath"]
,[10,"Frederick Balchum"]
]
df_schema = ["Id","FullName"]
# create example df
df = spark.createDataFrame(fullname_list, df_schema)

fake = Faker('en_GB')
fake_name = F.udf(fake.name)

df = df.withColumn("FakeFullName", fake_name())

df.display()

I understand that I can use seed_instance, but have no clue as to how to implement this in the code above so that I can pass "FullName" to the udf (apologies, Python newbie and tight delivery deadlines)

fake.seed_instance("Susan H")
fake.name()

Solution

  • Think I have worked out what to do. No idea whether it is the right approach (best practice, etc). Feel free to comment and let me know any other (and more efficient/Pythonic) methods:

    from faker import Faker
    import pyspark.sql.functions as F
    from pyspark.sql.functions import col, udf
    from pyspark.sql.types import StringType
    
    fullname_list = [[1,"Sarah Markwaithe"]
    ,[2,"John Bellamy"]
    ,[3,"Jordan Fingleberry"]
    ,[4,"Susan Merchant"]
    ,[5,"Bobby Franker"]
    ,[6,"Sally Smith-Holdern"]
    ,[7,"Finley Farringdon"]
    ,[8,"Sarah Markwaithe"]
    ,[9,"Simone Grath"]
    ,[10,"Frederick Balchum"]
    ]
    df_schema = ["Id","FullName"]
    # create example df
    df = spark.createDataFrame(fullname_list, df_schema)
    
    fake = Faker('en_GB')
    
    # create function that does what I need to do
    def generate_fake_name(str):
        fake.seed_instance(str)
        return fake.name()
    
    # Convert to UDF function
    fake_name = udf(generate_fake_name, StringType())
    
    # us UDF over dataframe
    df = df.withColumn("FakeFullName", fake_name(col("FullName")))
    df.show()
    

    results

    UPDATE: also including this if it helps someone else trying to achieve the same thing (I only wanted to generate a 'fake' name if the column contained a name): Updated dataframe above: ,[3,"Jordan Fingleberry"] to :,[3,""]

    # use UDF over dataframe to overwrite the existing column
    # only replace with a fake name if the column to be replaced contains a value
    Removed: 
    df = df.withColumn("FullName", when(col("FullName") == "",lit(None)).otherwise(fake_name(col("FullName"))))
    df.show()
    

    enter image description here