Search code examples
pythontensorflowjupyter-notebookdaskparquet

convert CSV file to parquet using dask (jupyter kernel crashes)


I am trying to convert a somewhat sizeable CSV file into parquet format using jupyter notebook. However, the notebook restarts when trying to convert it.

Since dask sizes up the memory and load chunks of data that fit in memory this error should not happen when executing for larger than memory datasets. (my reason behind the kernel crash is memory overload). I am running this kernel on a single machine in dask.

The code is below.


import dask
import dask.dataframe as dd
from dask.distributed import Client
client = Client()

merchant = dd.read_csv('/home/michael/Elo_Merchant/merchants.csv')
merchant.to_parquet('merchants.parquet') # kernel restarts when run this line.

UPDATE:

I used terminal to run the same thing and got this errors.

>>>merchant.to_parquet('merchants.parquet')
2019-03-06 13:22:29.293680: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
Aborted
$/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
  len(cache))

Would anyone be able to help me on this matter.

thanks

Michael


Solution

  • I found the solution to the problem. I changed the parquet conversion engine to fastparquet. Code is below. I had only installed pyarrow previously. If both are installed fastparquet will be the default engine. Nevertheless, I showed in the code since otherwise, it would be the same code as above.

    import dask.dataframe as dd
    
    merchant = dd.read_csv('/home/michael/Elo_Merchant/merchants.csv')
    merchant.to_parquet('merchants.parquet', engine='fastparquet') #Works 
    

    Hope this helps

    Thanks

    Michael