Search code examples

How can I get the file path for data shard in the Mapper of a Mapreduce job?

I have a mapreduce job, where the file input path is: /basedirectory/*/*.txt

Inside the basedirectory, I have different subfolders (CaseA, CaseB etc), each of which contain hdfs text files.

In the map phase of the job, I want to find out where exactly the data shard came from (e.g. CaseA). How can I achieve that?

I've done something similar for mapreduce jobs with more than 1 input hbase tables where I use context.getInputSplit().getTableName() to find the actual table name but not sure what to do for HDFS input files.


  • You can get input split using context.getInputSplit() (where context is mapper.context) and then use .getPath() method on the inputSplit to return the file path.