Search code examples
rapache-sparksparklyr

equivalent of "str()" (describes dataframe) for a spark table using sparklyr


My question boilds down to: what is the Sparklyr equivalent to the str R command?

I am opening a large table (from a file), call it my_table, in Spark, from R using the Sparklyr package.

How can describe the table? Column names and types, a few examples, etc.

Apologies in advance for what must be a very basic question but I did search for it, and checked Rstudio's Sparklyr Cheatsheet and did not find the answer.


Solution

  • Let's use the mtcars dataset and move it to a local spark instance for example purposes:

    library(sparklyr)
    library(dplyr)
    sc <- spark_connect(master = "local")
    tbl_cars <- dplyr::copy_to(sc, mtcars, "mtcars")
    

    Now you have many options, here are 2 of them, each slightly different - choose based on your needs:

    1.Collect the first row into R (now it is a standard R data frame) and look at str:

     str(tbl_cars %>% head(1) %>% collect())
    

    2.Invoke the schema method and look at the result:

    spark_dataframe(tbl_cars) %>% invoke("schema")
    

    This will give something like:

    StructType(StructField(mpg,DoubleType,true), StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true), StructField(hp,DoubleType,true), StructField(drat,DoubleType,true), StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true), StructField(vs,DoubleType,true), StructField(am,DoubleType,true), StructField(gear,DoubleType,true), StructField(carb,DoubleType,true))