Search code examples
compressionmicrosoft-fabric

Is Microsoft Fabric / Data Factory compressing any data it copies?


Looking for any data engineers working with Microsoft Fabric. I am seeing large amounts of compression of data sizes from Hyperscale SQL to Fabric Data Lake.

I know, why question a good thing?

I would generally expect normal snappy compression when converting SQL to parquet, but I am seeing compression when Fabric READs / Ingests the data from SQL before the Parquet Sink. Is this because of the compression codec that Data Factory uses? (Gzip, Deflate, BZIP2)?

When copying a SQL Table from Hyperscale these are the types of sizes that I get

  • HYPERSCALE SQL TABLE SIZE: 3GB
  • FABRIC DATA COPY SIZE: 2.1GB
  • PARQUET FILE SIZE: 1.6GB

Why does Fabric (Data Factory) copy less size than the SQL table?


Solution

  • Delta uses parquet for storing your rows, and parquet is a columnar compressed file format. And in Fabric the parquet file compression is enhanced by v-ordering the rows in each file. See Delta Lake table optimization and V-Order.

    Azure SQL Databsae supports table and index compression and columnstore indexes (which also have v-ordering), but the default storage format for SQL Server is an uncompressed rowstore. This format is extremely fast to read and write, but at the tradeoff of taking more space. Also SQL Server supports secondary indexes, which take additional space, while there are not secondary indexes on a Delta table.

    So it's not at all surprising that the data takes less space in OneLake than in Azure SQL Database.