i have stored .csv file on Hadoop HDFS,
hadoop dfs -ls /afs
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
17/01/12 15:15:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 item
-rw-r--r-- 2 hduser supergroup 203572404 2017-01-10 12:04 /afs/Accounts.csv
i want to import this file to rstudio using SparkR.
i tried following command:
sc<-sparkR.session(master = "spark://MasterNode:7077",appName = "SparkR",sparkHome = "/opt/spark")
sContext<- sparkRSQL.init(sc)
library(data.table)
library(dplyr)
df<- read.df(sContext, "hdfs://MasterNode:54310/afs/Accounts.csv")
following error occurred:
> df<- read.df(sContext, "hdfs://MasterNode:54310/afs/Accounts.csv")
Error in handleErrors(returnStatus, conn) :
No status is returned. Java SparkR backend might have failed.
In addition: Warning message:
In writeBin(requestMessage, conn) : problem writing to connection
please help me to import Accounts.csv file into rstudio using SparkR.
You can use the fread
function of the data.table
library to read from HDFS. You'd have to specify the path of the hdfs
executable in your system. For instance, assuming that the path to hdfs is /usr/bin/hdfs
, you can try something like this:
your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv")
If your "Accounts.csv" is a directory, you can use a wildcard as well /afs/Accounts.csv/*
You can also specify the column classes. For instance:
your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv", fill = TRUE, header = TRUE,
colClasses = c("numeric", "character", ...))
I hope this helps.