I'm want to start working with sparkR followed tutorials but I get the below error:
library(SparkR)
Sys.setenv(SPARK_HOME="/Users/myuserhone/dev/spark-2.2.0-bin-hadoop2.7")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
spark <- sparkR.session(appName = "mysparkr", Sys.getenv("SPARK_HOME"), master = "local[*]")
csvPath <- "file:///Users/myuserhome/dev/spark-data/donation"
mySparkDF <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "?")
mySparkDF.show()
But I get:
Error in mySparkDF.show() : could not find function "mySparkDF.show"
Not sure what I do wrong, in addition, I don't have code completion for the spark functions like read.df(...)
In addition if I try
show(describe(mySparkDF))
or
show(summary(mySparkDF))
I get in results the metadata and not the "describe" expected result
SparkDataFrame[summary:string, id_1:string, id_2:string, cmp_fname_c1:string, cmp_fname_c2:string, cmp_lname_c1:string, cmp_lname_c2:string, cmp_sex:string, cmp_bd:string, cmp_bm:string, cmp_by:string, cmp_plz:string]
Anything i'm doing wrong?
show
is not used in such a way in SparkR, neither it serves the same purpose with the same-name command in PySpark; you should use either head
or showDF
:
df <- as.DataFrame(faithful)
show(df)
# result:
SparkDataFrame[eruptions:double, waiting:double]
head(df)
# result:
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
showDF(df)
# result:
+---------+-------+
|eruptions|waiting|
+---------+-------+
| 3.6| 79.0|
| 1.8| 54.0|
| 3.333| 74.0|
| 2.283| 62.0|
| 4.533| 85.0|
| 2.883| 55.0|
| 4.7| 88.0|
| 3.6| 85.0|
| 1.95| 51.0|
| 4.35| 85.0|
| 1.833| 54.0|
| 3.917| 84.0|
| 4.2| 78.0|
| 1.75| 47.0|
| 4.7| 83.0|
| 2.167| 52.0|
| 1.75| 62.0|
| 4.8| 84.0|
| 1.6| 52.0|
| 4.25| 79.0|
+---------+-------+
only showing top 20 rows