parallelize conversion of a single 16M row csv to Parquet with dask

The following operation works, but takes nearly 2h:

from dask import dataframe as ddf
ddf.read_csv('data.csv').to_parquet('data.pq')

Is there a way to parallelize this?

The file data.csv is ~2G uncompressed with 16 million rows by 22 columns.

Solution

I'm not sure if it is a problem with data or not. I made a toy example on my machine and the same command takes ~9 seconds

import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.distributed import Client
import dask
client = Client()
# if you wish to connect to the dashboard
client

# fake df size ~2.1 GB
# takes ~180 seconds
N = int(5e6)
df = pd.DataFrame({i: np.random.rand(N) 
                   for i in range(22)})
df.to_csv("data.csv", index=False)

# the following takes ~9 seconds on my machine
dd.read_csv("data.csv").to_parquet("data_pq")