The following operation works, but takes nearly 2h:
from dask import dataframe as ddf
ddf.read_csv('data.csv').to_parquet('data.pq')
Is there a way to parallelize this?
The file data.csv
is ~2G uncompressed with 16 million rows by 22 columns.
I'm not sure if it is a problem with data or not. I made a toy example on my machine and the same command takes ~9 seconds
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.distributed import Client
import dask
client = Client()
# if you wish to connect to the dashboard
client
# fake df size ~2.1 GB
# takes ~180 seconds
N = int(5e6)
df = pd.DataFrame({i: np.random.rand(N)
for i in range(22)})
df.to_csv("data.csv", index=False)
# the following takes ~9 seconds on my machine
dd.read_csv("data.csv").to_parquet("data_pq")