Explanations to the two questions I asked are at the end of this post.
I am trying to get a simple Wordcount program to run so I play around and see what does what.
I currently have an implementation which seems to run perfectly fine to the very end. Then after my last line in Main() (which is just println saying so) I get output which looks like a summary of the Hadoop job with a single exception.
In my Mapper and Reducer functions I also have a line which simply outputs arbitrary text to the screen just so I know it hits the line, but through out run time I never see either of these lines get hit. I believe this is causing the IOException mentioned above.
I have 2 questions:
setMapperClass()
,
setCombinerClass()
and setReducerClass()
not get executed? I've saved the output of running the job to a file:
Enter the Code to run the particular program.
Wordcount = 000:
Assignment 1 = 001:
Assignment 2 = 002:
000
/usr/dan/wordcount/
/usr/dan/wordcount/result.txt
May 04, 2014 2:22:28 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
May 04, 2014 2:22:29 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 2
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
May 04, 2014 2:22:29 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 2
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.MapTask <init>
INFO: io.sort.mb = 100
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.MapTask <init>
INFO: data buffer = 79691776/99614720
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.MapTask <init>
INFO: record buffer = 262144/327680
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.LocalJobRunner run
WARNING: job_local_0001
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:541)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
May 04, 2014 2:22:30 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: map 0% reduce 0%
May 04, 2014 2:22:30 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0001
May 04, 2014 2:22:30 PM org.apache.hadoop.mapred.JobClient log
INFO: Counters: 0
Not Fail!
I hit the end of wordcount!
I hit the end of Main()
The way I have the application set up is the main class which based on user input sends the flow to the respective class. In case it will help I will post only class I'm working on at the moment. If you need to see more just ask.
package hadoop;
import java.io.IOException;
import java.util.Arrays;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
/**
*
* @author Dans Laptop
*/
public class Wordcount {
public static class TokenizerMapper
extends org.apache.hadoop.mapreduce.Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, org.apache.hadoop.mapreduce.Reducer.Context context
) throws IOException, InterruptedException {
System.out.println("mapper!");
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends org.apache.hadoop.mapreduce.Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
org.apache.hadoop.mapreduce.Reducer.Context context
) throws IOException, InterruptedException {
System.out.println("Reducer!");
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public void wordcount(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
System.out.println(args[0]);// Prints arg 1
System.out.println(args[1]);// Prints arg 2
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "wordcount");
job.setJarByClass(Wordcount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
try{
job.waitForCompletion(true);
System.out.println("Not Fail!");
}catch(Exception e){
System.out.println(e.getLocalizedMessage());
System.out.println(e.getMessage());
System.out.println(Arrays.toString(e.getStackTrace()));
System.out.println(e.toString());
System.out.println("Failed!");
}
System.out.println("I hit the end of wordcount!");//Proves I hit the end of wordcount.
}
}
The command being used to run the jar is (from the location /usr/dan):
hadoop -jar ./hadoop.jar /usr/dan/wordcount/ /usr/dan/wordcount/result.txt
Note: I expect the program to look at all of the files in /usr/dan/wordcount and then create a file /usr/dan/wordcount/result.txt which lists each word and the number of times it occurs. I am not getting this behavior yet, but I want to figure out these 2 questions I have so I can troubleshoot it the rest of the way.
Response to @Alexey:
I did not realize it was not possible to print to console directly during MapReduce job execution in hadoop. I had just assumed those line were not being executed. Now I know where to look for any output during a job. However following the directions on the question you linked to did not display any jobs for me to look at. Maybe because I have not had any jobs fully completed though.
I've switched from job.submit();
to job.waitForCompletion(true);
but still am getting the output after it. I don't know if this indicates something else is wrong or not, but figured I'd document it.
I've added the lines you suggested (These set the output of the Map class only?):
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
and left/removed the lines (These set the output of both the Map and Reduce classes?):
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
I am still getting the same Exception. From reading about it online, it seems the error has to do with the output types of the Map class not matching the input types of the Reduce class. It seems pretty explicit in my code that these two values match. The only thing that has me confused is where is it getting LongWritable from? I do not have that anywhere in my code. After looking at this question Hadoop type mismatch in key from map expected value Text received value LongWritable this was the same case, but the solution was to specify the Mapper and Reducer Classes which I am already doing. The other thing I notice in my code is the input key of my Mapper class is of type Object. Could this have any significance? The error says it is a mismatch in the key FROM Map though.
I've also went ahead and updated my code/results.
Thank you for your help so far, I've already picked up a lot of information based off of your response as it is.
EXPLANATIONS
Your first question looks similar to this one How to print on console during MapReduce job execution in hadoop .
Line job.submit();
tells hadoop to run job, but not wait until job's completion. I think you may want to replace this line with job.waitForCompletion(true);
, so there would be no output after "I hit the end of wordcount!".
To get rid of exception you should specify output key and value classes of your mapper:
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class);