Search code examples
hadoopmappersequencefile

How does Mapper class identify the SequenceFile as inputfile in hadoop?


In my one MapReduce task, I override the BytesWritable as KeyBytesWritable, and override the ByteWritable as ValueBytesWritable. Then I output the result using SequenceFileOutputFormat.

My question is when I start the next MapReduce task, I want to use this SequenceFile as inputfile. So how could I set the jobclass, and how the Mapper class could identify the key and value in the SequenceFile which I overrided before?

I understand that I could SequenceFile.Reader to read the key and value.

Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))

But I don't know how to use this Reader to pass the key and value into Mapper class as Parameters. How could I set conf.setInputFormat to SequenceFileInputFormat and then let Mapper get the key and values?

Thanks


Solution

  • You do not need to manually read the sequence file. Just set the input format class to sequence file:

    job.setInputFormatClass(SequenceFileInputFormat.class);
    

    and set the input path to the directory containing yor sequence files.

    FileInputFormat.setInputPaths(<path to the dir containing your sequence files>);
    

    You will need to pay attention to the (Key,Value) types of the inputs on the parameterized types of your Mapper class to match the (key,value) tuples inside your sequence file.