Search code examples
rhadoop

Sorting Data using RHadoop


I'm pretty new in Hadoop & RHadoop. So, was trying to sort data in Mapreduce structure using RHadoop. But I can't sort the data. The code is given below. Can anybody please help me to find out where I'm making the mistake. The reason for trying this problem is want to know how to define key variable & value variable.

small.ints=runif(100,10.0,20.0)
data<-sample(1:100,100,replace=F)
data1<-data.frame(data,small.ints)
hdfs.input = to.dfs(data1)
# Mapper
mapper <- function(k,v) {
  key <- data
  value <-small.ints
  keyval(key,value)
}

#Reducer

reducer <- function(k,v) {
  key <- k  
  value <- v
  keyval(key,arrange(v))
}
#mapreduce program
out<-mapreduce(
  input = hdfs.input,
  map = mapper,reduce=reducer)

Thanks a lot!


Solution

  • It's not clear from your question what exactly you are trying to have sorted. It appears from your code that you are trying to sort values ('small.ints') within each key.

    Reducer operates on a data set per key. In your case you have 100 rows for keys and values, and all the key rows are unique (since data = sample(1:100, 100, replace = F), essentially 'data' is 1:100 in random order).

    That means that for each key you have only one value. It does not matter which way you sort it, the order will always be the same: 12 = sort(12) = sort(12, decrease = TRUE).

    If you would like to have the data set sorted by 'data', then I think the mapper should be:

    mapper <- function(k,v) {
      # input: key = NULL, value = (data, small.ints)
      keyval(k, arrange(v, data))
    }
    
    # mapreduce program
    out <- mapreduce(
      input = hdfs.input,
      map = mapper,
      reduce = NULL)