Search code examples
pysparkdatabricksazure-databricksazure-data-lake-gen2external-tables

Count on External Table to Azure Data Storage is taking too long


I have created an External table to Azure Data Lake Storage Gen2.

The Container has about 200K Json files.

The structure of the json files are created with

CREATE EXTERNAL TABLE IF NOT EXISTS dbo.table(   
    ComponentInfo STRUCT<ComponentHost: STRING, ComponentId: STRING, ComponentName: STRING, ComponentVersion: STRING, SubSystem: STRING>,
    CorrelationId STRING,   
    Event STRUCT<Category: STRING, EventName: STRING, MessageId: STRING, PublishTime: STRING, SubCategory: STRING>,   
    References STRUCT<CorrelationId: STRING>) 
    USING org.apache.spark.sql.json OPTIONS ('multiLine' = 'true') 
    LOCATION 'dbfs:/mnt/mnt'

Counting takes such a long time to run and still at stage 62 with 754 tasks. Loading top 200 is fine but is there an incorrect setup that needs to be addressed. Should it be partitioned?

select count(*) from dbo.table

Solution

  • This is a too many small files Databricks issue.

    https://www.databricks.com/session_na21/degrading-performance-you-might-be-suffering-from-the-small

    Combined the files and count works fine.