Search code examples
dataframedatabricksazure-databricks

2G csv file on databricks cluster after 20 minutes can't get as little as count(1)


The file is a 2.6Gig csv file is 30 columns, not believed any are wider than 50-ish characters.

I spark.read this file, no errors

I createOrReplaceTempView and select the top 1000, no errors.

I then select count(1) from the tempView.

After 20 minutes I cancel the count(1) because I still don't have the row count.

At what seems like the 5 minute mark I can see 49 meg read and about 2.5 Million records but Spark UI seems to be stuck at that point until cancelled.

I'm the only one on this production grade cluster with 8 nodes and 256G ram.

What do you think I should go after. If I could at least get a count I might feel like I can go after saving to delta with partitions.


Solution

  • Try the following things:

    1. Cache the data before registering the temp view.
    2. Repartition of the data before registering the temp view.