The file is a 2.6Gig csv file is 30 columns, not believed any are wider than 50-ish characters.
I spark.read this file, no errors
I createOrReplaceTempView and select the top 1000, no errors.
I then select count(1) from the tempView.
After 20 minutes I cancel the count(1) because I still don't have the row count.
At what seems like the 5 minute mark I can see 49 meg read and about 2.5 Million records but Spark UI seems to be stuck at that point until cancelled.
I'm the only one on this production grade cluster with 8 nodes and 256G ram.
What do you think I should go after. If I could at least get a count I might feel like I can go after saving to delta with partitions.
Try the following things: