Search code examples
apache-sparkpysparkdatabricksparquet

Most optimal method to check length of a parquet table in dbfs with pyspark?


I have a table on dbfs I can read with pyspark, but I only need to know the length of it (nrows). I know I could just read the file and do a table.count() to get it, but that would take some time.

Is there a better way to solve this?


Solution

  • I am afraid not.

    Since you are using dbfs, I suppose you are using Delta format with Databricks. So, theoretically, you could check the metastore, but:

    The metastore is not the source of truth about the latest information of a Delta table

    https://docs.delta.io/latest/delta-batch.html#control-data-location