Search code examples
apache-sparkparquetgenerate

What is a fast way to generate parquet data files with Spark for testing Hive/Presto/Drill/etc?


I frequently find myself needing to generate parquet files for testing infrastructure components like Hive, Presto, Drill, etc.

There are surprisingly few sample parquet data sets online, and one of the only ones I come across here https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet is mock data for credit card numbers, incomes, etc. I don't like having that in my data lakes in case someone thinks its real.

What is the best way to generate parquet data files when you need to test? I usually have spark around and end up using that; and I'll post my solution as an answer since one doesn't seem to exist here. But I'm curious what better solutions people have using spark or other technologies.


Solution

  • The farsante library lets you generate fake PySpark / Pandas datasets that can easily be written out in the Parquet file format. Here's an example:

    import farsante
    from mimesis import Person
    from mimesis import Address
    from mimesis import Datetime
    
    person = Person()
    address = Address()
    datetime = Datetime()
    df = farsante.pyspark_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
    df.write.mode('overwrite').parquet('./tmp/spark_fake_data')
    

    It's easier to simply use Pandas to write out sample Parquet files. Spark isn't needed for a task like this.

    df = farsante.pandas_df([person.full_name, person.email, address.city, address.state, datetime.datetime], 3)
    df.to_parquet('./tmp/fake_data.parquet', index=False)
    

    Looks like there is a Scala faker library but it doesn't look nearly as mature as the mimesis library. Go has good faker and Parquet libraries so that's another option for generating fake data.