Search code examples
pysparkapache-spark-sqlrdddatabricks

How to un-cache a dataframe?


I'm seeing strange behavior in Pyspark (Databricks) where a SQL query in a new notebook returns results, however when I apply LIMIT 100 to the SQL, or limit(100) to the dataframe, I get no results (whereas I expect to see the first 100).

In a separate notebook that operates with the same SELECT * FROM VIEW_XYZ I apply limit(100) and then cache() the dataframe. This may be a factor? Totall separate notebooks, but same SQL query.

I have a hypothesis that cache() has overzealously cached the query SELECT * FROM VIEW_XYZ LIMIT 100 along with its results from almost two weeks ago. That's about how long this query has been returning "nothing" as my notebook runs hourly.

To test this hypothesis, I would like to "un-cache" the query/dataframe. How do I un-cache it?

EDIT: Adding the database name to the view dbname.viewname fixed it. I don't know why or how.

enter image description here


Solution

  • Don't think cache has anything to do with your problem.

    To uncache everything you can use spark.catalog.clearCache(). Or try restarting the cluster, cache persists data over the cluster, so if it restarts cache will be empty, and you can verify that its not a cache related issue.

    All your notebooks might be attached to the same cluster which might be giving random errors because of some old version of runtime.

    Try creating new cluster with latest runtime, old clusters with old runtime do act weird sometimes. Sometimes notebooks that have too much output and too many commands act like this. Just start using the new one and you might never see any such problem.

    A general advice is to use LTS version of runtime. You can read more about it here.