Search code examples
rhadooprhadoop

Error when running wordcount R example code on Hadoop


R wordcount example code:

library(rmr2) 
map <- function(k,lines) {
    words.list <- strsplit(lines, '\\s') 
    words <- unlist(words.list)
    return( keyval(words, 1) )
}
reduce <- function(word, counts) { 
    keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) { 
    mapreduce(input=input, output=output, input.format = "text", map=map, reduce=reduce)
}
system("/opt/hadoop/hadoop-2.5.1/bin/hadoop fs -rm -r /wordcount/out")
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')

When I executing the last statement of the R code, it gives the following error messages.

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

After the error, it display:

INFO mapreduce.Job:  map 100% reduce 100%

and

ERROR streaming.StreamJob: Job not Successful! Streaming Command Failed!

The out put folder is created in HDFS, but no result is generated. Any idea what might be causing the problem?

Update 1:

I found out an error log that provided by Hadoop for the specific job at localhost:8042

Dec 11, 2014 3:26:38 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get
WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information.
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class
Dec 11, 2014 3:26:40 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
Dec 11, 2014 3:26:43 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
Dec 11, 2014 3:26:45 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Anyone knows what the issue is?

Update 2:

I found extra logging information at $HADOOP_HOME/logs/userlogs/[application_id]/[container_id]/stderr:

...
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
  error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("base", "methods", "datasets", "utils", "grDevices", "graphics",  :
can't load rhdfs
Loading required package: rmr2
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
there is no package called ‘stringr’
...

Solution

  • After taking a deeper look into the error logs, it seems I have installed the R library on the user level, which I should have installed on the system level. Detail how to install R library onto the system level can be found on this thread.("dev_tools" packages may comes in handy and remember to run R under sudo, or you can prefer sudo R CMD INSTALL [package_name])

    You can double check the packages installed path in R by system.file(package="[package_name]"), though this always display the first preferred library path of the package. So I highly recommend to previously installed user libraries.

    Run a few times more to double check the error log and make sure packages are installed correctly in R system lib. The stderr log is helpful but no one has pointed out the actual location before :-(