Search code examples
databaseparquetdata-warehouseazure-synapse

Azure Synapse, design questions of External tables or Internal tables


I'm designing a Dataware house in Azure Synapse using SQL Pool, but I'm facing some design questions.

Context: My plan is to load Partitioned Parquet files using Azure Data Lake Storage (ADLS), then, with SQL pool create External Tables to query those files.

My questions are:

  • Is it better in terms of performance to provide the solution just with the external tables? that is, with no create internal tables neither CTAS, BCP, or copy methods from the ADLS to storage in the database.
  • Is it possible to perform partitioning in external tables? is it enough to organize the parquet by folders named by date?
  • How does affect the user concurrency to the external tables and the internal tables? some experienced recommendations?.

Thanks for your time. Josh


Solution

  • Is it better in terms of performance to provide the solution just with the external tables?

    No. Internal Tables are distributed columnstores, with multiple levels of caching, and typically out-perform external parquet tables. Internal tables additionally support batch-mode scanning, columnstore ordering, segment elimination, partition elimination, materialized views, and resultset caching.

    Is it possible to perform partitioning in external tables?

    This is not currently possible in Dedicated SQL Pools, see Folder Partition Elimination

    How does affect the user concurrency to the external tables and the internal tables?

    Concurrency is a matter of query performance. The faster your queries perform, the faster sessions give up their concurrency slot. So anything that improves query performance improves the effective concurrency (the number of concurrent users you can support with reasonable query runtime).

    Serverless SQL Pools currently have more advanced capabilities for working with data stored as Parquet or Delta in the Data Lake.