I have a csv with the size equal to 170kB, and when I convert them to the parquet file, the size is 1.2MB. The data structure is 12 columns with strings.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_filename = "../files/test.csv"
parquet_filename = '../files/sample.parquet'
chunksize = 1
pqwriter = None
for i, df in enumerate(pd.read_csv(csv_filename, delimiter='_;_', chunksize=chunksize)):
#df = df.astype(str)
table = pa.Table.from_pandas(df=df)
# for the first chunk of records
if i == 0:
# create a parquet write object giving it an output file
pqwriter = pq.ParquetWriter(parquet_filename, table.schema, compression='gzip', use_dictionary=False)
pqwriter.write_table(table)
# close the parquet writer
if pqwriter:
pqwriter.close()
df = pd.read_parquet(parquet_filename)
print(df.memory_usage(deep=True))
Update 1:
I tried with fastparquet
and I got 933kB in size.
for i, df in enumerate(pd.read_csv(csv_filename, delimiter='_;_', chunksize=chunksize)):
fastparquet.write(parquet_filename, df, compression='gzip', append=True)
Update 2:
The parameter chunksize
have impact in file size. If larger, the size decreases. Using chunksize
equal 30, the size was 76kB.
This mainly boils down to using a extremely small chunk size and thus disabling the columnar nature (and thus all benefits) of the Parquet format. Chunks in Parquet files are forced breaks, there will be no optimization applied over two blocks.
Given 170KB is a really small size for Parquet, you shouldn't be chunking at all. Normally a reasonable chunk size is one where your data yields chunks of 128MiB in size, in some cases smaller chunks make sense, but for most use cases a single chunk or chunks of the size of 128MiB are the right choice.
Inside a chunk, Parquet applies various compression and encoding techniques to efficiently (CPU and size efficiency) to store the data column-by-column. These techniques are getting more effective the more data they can work on. Setting a chunk size to a single digit value remove any benefit from these but also adds more overhead to the file itself as Parquet also stores a header and some metadata like column statistics per column chunk. With chunk_size=1, this means each single row will be stored 3-4 times in the file, not even accounting for the extra metadata headers.