Search code examples
pythoncsvdataframeparquetdask

parallelize conversion of a single 16M row csv to Parquet with dask


The following operation works, but takes nearly 2h:

from dask import dataframe as ddf
ddf.read_csv('data.csv').to_parquet('data.pq')

Is there a way to parallelize this?

The file data.csv is ~2G uncompressed with 16 million rows by 22 columns.


Solution

  • I'm not sure if it is a problem with data or not. I made a toy example on my machine and the same command takes ~9 seconds

    import dask.dataframe as dd
    import numpy as np
    import pandas as pd
    from dask.distributed import Client
    import dask
    client = Client()
    # if you wish to connect to the dashboard
    client
    
    # fake df size ~2.1 GB
    # takes ~180 seconds
    N = int(5e6)
    df = pd.DataFrame({i: np.random.rand(N) 
                       for i in range(22)})
    df.to_csv("data.csv", index=False)
    
    # the following takes ~9 seconds on my machine
    dd.read_csv("data.csv").to_parquet("data_pq")