Search code examples
pythonvega-liteapache-arrow

arrow file size is the same as csv?


I am trying to save a dataframe into .arrow format, mainly to get better size than CSV, to use that file to vega-lite

I am using python

import pandas
import pyarrow as pa
csv="C:/Users/mimoune.djouallah/data.csv"
arrow ="C:/Users/mimoune.djouallah/file.arrow"
dataset = pandas.read_csv(csv)

table = pa.Table.from_pandas(dataset)
writer = pa.RecordBatchFileWriter(arrow, table.schema)
writer.write(table)
writer.close()

I was expecting the arrow file to be less than the csv, for now arrow is slightly bigger

I tried to export using parquet and the result are as expected

original csv : 4.4 MB arrow : 4.9 MB parquet : 1.6 MB PowerBI ( just for reference) : 1.7 MB


Solution

  • The Arrow format is not aiming optimising storage size but storage performance. In contrast to CSV, the data is stored in binary form to remove the overhead of parsing the data. But as performance is critical, data is neither compressed nor encoded.

    If you want to store data efficiently but with a smaller data size, you should have a look at Apache Parquet. The data is stored in a similar fashion as Arrow but with some efficient techniques on top to decrease storage size.