Search code examples
rcsvhadoophdfssparkr

importing csv file in rstudio from hdfs using sparkR


i have stored .csv file on Hadoop HDFS,

hadoop dfs -ls /afs
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

17/01/12 15:15:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 item
-rw-r--r--   2 hduser supergroup  203572404 2017-01-10 12:04 /afs/Accounts.csv

i want to import this file to rstudio using SparkR.

i tried following command:

sc<-sparkR.session(master = "spark://MasterNode:7077",appName = "SparkR",sparkHome = "/opt/spark")
sContext<- sparkRSQL.init(sc)
library(data.table)
library(dplyr)

df<- read.df(sContext, "hdfs://MasterNode:54310/afs/Accounts.csv")

following error occurred:

> df<- read.df(sContext, "hdfs://MasterNode:54310/afs/Accounts.csv")
Error in handleErrors(returnStatus, conn) : 
  No status is returned. Java SparkR backend might have failed.
In addition: Warning message:
In writeBin(requestMessage, conn) : problem writing to connection

please help me to import Accounts.csv file into rstudio using SparkR.


Solution

  • You can use the fread function of the data.table library to read from HDFS. You'd have to specify the path of the hdfs executable in your system. For instance, assuming that the path to hdfs is /usr/bin/hdfs, you can try something like this:

    your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv")
    

    If your "Accounts.csv" is a directory, you can use a wildcard as well /afs/Accounts.csv/* You can also specify the column classes. For instance:

    your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv", fill = TRUE, header = TRUE, 
    colClasses = c("numeric", "character", ...))
    

    I hope this helps.