Search code examples
pythonfaker

Maximum Limit of distinct fake data using Python Faker package


I have used Python Faker for generating fake data. But I need to know what is the maximum number of distinct fake data (eg: fake names) can be generated using faker (eg: fake.name() ).

I have generated 100,000 fake names and I got less than 76,000 distinct names. I need to know the maximum limit so that I can know how much we can scale using this package for generating data.

I need to generate huge dataset. I also want to know is Php faker, perl faker are all same for different environments?

Other packages for generating huge dataset will be highly appreciated.


Solution

  • I had this same issue and looked more into it.

    In the en_US provider there about 1000 last names and 750 first names for about 750000 unique combos. If you randomly select a first and last name, there is a chance you'll get duplicates. But in reality, that's how the real world works, there are many John Smiths and Robert Doyles out there.

    There are 7203 first names and 473 last names in the en profile which can kind of help. Faker chooses the combo of first name and last name meaning there are about 7203 * 473 = 3407019.

    But still, there is a chance you'll get duplicates.

    I solve this problem by adding numbers to names.

    I need to generate huge dataset.

    Keep in mind that in reality, any huge dataset of names will have duplicates. I work with large datasets (> 1 million names) and we see a ton of duplicate first and last names.

    If you read the faker package code, you can probably figure out how to modify it so you get all 3M distinct names.