Search code examples
apache-sparkcentosredhatsparkr

SparkR says it can't find function read.df


Just what the title says. Every time I fire up the SparkR shell on the RedHat machine I'm using and try to use the function read.df(), it says that it could not find that function. I'm using SparkR 2.0, if that helps.

To be more specific, here's what I tried to type:

data <- read.df(sqlContext, "/path/to/the/file", "parquet")

Edit: To clarify, here is the exact situation:

> data <- df.read("valid/path/to/parquet/file", "parquet") Error: could not find function "df.read"


Solution

  • I figured out what the problem is, and figured I'd post it in case anyone else had a similar issue. Basically I opened the R shell, and ran install.packages("devtools"). This allowed me to install the sparkR package directly from github like this: devtools::install_github("/apache/spark/R/pkg"). That worked. There were some other little details, too, like using R's setRepositories() function to make sure I had all repos enabled to download devtools.

    I had done all of that before, though. The real problem was threefold:

    1. I incorrectly typed in the function. There is a lot of conflicting documentation about it from different versions (something I have noticed is kind of a trend with Spark-related endeavors; check for versions before trusting any documentation!). The correct syntax is read.df("/path/to/file", "parquet") where "parquet" can be json or whatever file type you're reading in.

    2. I needed to attach the sparkR package after I opened the R shell!!! I'm really new to R and sparkR and honestly 99% of what I'm trying to do, so I didn't actually know that R didn't automatically load all available packages at the beginning of a session. Actually, it makes a lot of sense that it doesn't. So I had to type in require("SparkR") into the shell prompt before I could actually read in any dataframes. (Note that the S in "SparkR" is capitalized; I think this could lead to some confusion since in all the googling and research and combing through APIs that I did to arrive at this solution, many times the s in SparkR was lowercase.)

    3. I didn't initialize a SparkSession. (Duh!) Once you require the SparkR package, this is the (mandatory) next step, or else you won't be able to do anything Spark-related. A session can be initialized by typing sparkR.session() into the R shell prompt. Note that for some reason, the s in sparkR is lowercase here! That is really confusing and I hope in future updates that inconsistency is fixed.

    Now I'm able to read in any dataframes I want using the following syntax:

    data <- read.df("/valid/path/to/parquet/file", "parquet")