I have tables with columns containing indicators which can contain values 0 or 1 (0 denoting false, 1 denoting true). I would assume converting the type from number into boolean would lead to a size reduction in the written table. I did a little experiment writing two tables, both containing an identical ID column, and one containing numeric values and the other containing boolean values:
import random
random.seed(30)
n = 1000000
ids = [random.randint(0, n*10) for i in range(0, n)]
indicators = [random.randint(0, 1) for i in range(0, n)]
bools = [bool(indicator) for indicator in indicators]
data_indicators = zip(ids, indicators)
df_indicators = spark.createDataFrame(data_indicators, ["id", "indicator"])
data_bools = zip(ids, bools)
df_bools = spark.createDataFrame(data_bools, ["id", "bool"])
However, both written tables (delta format) show to be 5.1MiB in size (the number type is transformed into BigInt). How is this possible?
I expected the table with the boolean type to be smaller
One optimization technique in parquet, among many others, is bit-packing. If you have BIGINT
column that has small numbers you only need to store the N significant bits that cover the range, for [0 - 1] range you only need 1 bit. See this presentation.