Similar to this code snippet that lists the memory usage of objects in the local R
environment, is there a similar command to see the memory of DataFrames available in a Spark connection? E.g. Something similar to src_tbls(sc)
, that currently only lists all DataFrames but not the memory usage.
First of all you have to remember that data structures used in Spark are by default lazy. Unless there are cached there is not data related storage overhead. Cache itself is ephemeral - depending on StorageLevel
data can be evicted, lost as a result of a failure or when node is decommissioned.
You also have to remember to SQL uses compressed columnar storage, so memory usage might be affected by the distribution of the data.
If you're interested in total memory usage as seen by the operating system you should rather use a proper monitoring solution, like Ganglia or Munin.
That being said one can access information about the current status using SparkContext
:
sc <- spark_connect(...)
sc %>%
spark_context %>%
invoke("getRDDStorageInfo")
or by querying Spark UI:
url <- sc %>% spark_context %>% invoke("uiWebUrl") %>% invoke("get")
browseURL(paste(url, "storage", sep="/"))
or REST API:
app_id <- sc %>% spark_context %>% invoke("applicationId")
httr::GET(paste(
url, "api", "v1", "applications", app_id, "storage", "rdd", sep="/"
))