Search code examples
pythonpandaspysparkparquetfeather

Is there an efficient way of changing a feather file to a parquet file?


I have a big feather file, which I want to change to parquet, so that I can work with Pyspark. Is there a more efficient way of change the file type than doing the following:

df = pd.read_feather('file.feather').set_index('date')

df_parquet = df.astype(str)
df_parquet.to_parquet("path/file.gzip",
               compression='gzip')

As dataframe df kills my memory, I'm looking for alternatives. As of this post I understand that I can't read in feather from Pyspark directly


Solution

  • With the code you posted, you are doing the following conversions:

    1. Load data from Disk into RAM; feather files already are in the Arrow format.
    2. Convert the DataFrame from Arrow into pandas
    3. Convert the DataFrame from pandas into Arrow
    4. Serialize the DataFrame from Arrow into Parquet.

    Steps 2-4 are each expensive steps to do. You will not be able to avoid 4 but by keeping the data in Arrow without going the loop into pandas, you can avoid 2+3 with the following code snippet:

    import pyarrow as pa
    import pyarrow.feather as feather
    import pyarrow.parquet as pq
    
    table = feather.read_table("file.feather")
    pq.write_table(table, "path/file.parquet")
    

    A minor issue but you should avoid using the .gzip ending with Parquet files. A .gzip / .gz ending indicates that the whole file is compressed with gzip and that you can unzip it with gunzip. This is not the case with gzip-compressed Parquet files. The Parquet format compresses individual segments and leaves the metadata uncompressed. This leads to nearly the same compression at a much higher compression speed. The compression algorithm is thus an implementation detail and not transparent to other tools.