2G csv file on databricks cluster after 20 minutes can't get as little as count(1)

The file is a 2.6Gig csv file is 30 columns, not believed any are wider than 50-ish characters.

I spark.read this file, no errors

I createOrReplaceTempView and select the top 1000, no errors.

I then select count(1) from the tempView.

After 20 minutes I cancel the count(1) because I still don't have the row count.

At what seems like the 5 minute mark I can see 49 meg read and about 2.5 Million records but Spark UI seems to be stuck at that point until cancelled.

I'm the only one on this production grade cluster with 8 nodes and 256G ram.

What do you think I should go after. If I could at least get a count I might feel like I can go after saving to delta with partitions.

Solution

Try the following things:

Cache the data before registering the temp view.
Repartition of the data before registering the temp view.

How to fill blank cells created by join but keep original null in pandas
Convert a list of time string to unique string format
Assignin lists as elements of CUDF DataFrame
Find unique values for all the columns of a dataframe
r dataframe and matrix result in different rownames when using rbind
Pandas Price Analysis
Groupby by sum of revenue and the corresponding highest contributing month - Pandas
Select rows from DataFrame where ID count is greater than X
fill nearest value in a column when null of pandas data frame
Calculate transition probabilities
Create new variable based on partial string matching in column R
How to select rows between a certain date range in python-polars?
Sample from each group in polars dataframe?
how to Send dataframe as html table with font styling based on text value as a email attachment
How to subset R dataframe based on specific values in several columns?
Unable to concatenate dataframes in streamlit
Take min and max dates for a sequence along a column
How to perform split/merge/unpivot with Python and polars?
Efficient manner to compare and match structures between two data frames in R?
Find non-overlapping intervals within DNA coordinates
How to expand a single-index DataFrame to a multiindex DataFrame in an efficient way? (python, pandas)
How convert a list into multiple columns and a dataframe?
(Polars) How to get element from a column with list by index specified in another column
Cumulative subtraction in Pandas Dataframe?
Create Pivot table and add additional columns from another dataframe
How can I map a field of a polars struct from values of another field `a`, to values of another field `b`?
Polars pivot + unpivot operation with multiple values (pandas stack / unstack alternative / UDF over)
How can I subclass a Pandas DataFrame?
Extracting data from two nested columns in one dataframe
Modifing a Pandas Dataframe using Pivot Tables or Group By