I'm seeing strange behavior in Pyspark (Databricks) where a SQL query in a new notebook returns results, however when I apply LIMIT 100
to the SQL, or limit(100)
to the dataframe, I get no results (whereas I expect to see the first 100).
In a separate notebook that operates with the same SELECT * FROM VIEW_XYZ
I apply limit(100)
and then cache() the dataframe. This may be a factor? Totall separate notebooks, but same SQL query.
I have a hypothesis that cache()
has overzealously cached the query SELECT * FROM VIEW_XYZ LIMIT 100
along with its results from almost two weeks ago. That's about how long this query has been returning "nothing" as my notebook runs hourly.
To test this hypothesis, I would like to "un-cache" the query/dataframe. How do I un-cache it?
EDIT: Adding the database name to the view dbname.viewname
fixed it. I don't know why or how.
Don't think cache
has anything to do with your problem.
To uncache everything you can use spark.catalog.clearCache()
. Or try restarting the cluster, cache
persists data over the cluster, so if it restarts cache
will be empty, and you can verify that its not a cache
related issue.
All your notebooks might be attached to the same cluster which might be giving random errors because of some old version of runtime.
Try creating new cluster with latest runtime, old clusters with old runtime do act weird sometimes. Sometimes notebooks that have too much output and too many commands act like this. Just start using the new one and you might never see any such problem.
A general advice is to use LTS version of runtime. You can read more about it here.