Search code examples
hadoopcachingmapreducehadoop2

Write data to local disk in each datanode


I want to store some value in map task into local disk in each data node. For example,

public void map (...) {
   //Process
   List<Object> cache = new ArrayList<Object>();
   //Add value to cache
   //Serialize cache to local file in this data node
}

How can I store this cache object to local disk in each data node, because if I store this cache in map function like above, then the performance will be terrible because I/O task?

I mean is there any way to wait for map task in this data node run completely and then we will store this cache into local disk? Or does Hadoop have a function to solve this issue?


Solution

  • Please see below example, the created file will be somewhere under the directories used by NodeManager for containers. This is configuration property yarn.nodemanager.local-dirs in yarn-site.xml, or the default inherited from yarn-default.xml, which is under /tmp

    Please see @Chris Nauroth answer, Which says that Its just for debugging purpose and It's not recommended as a permanent production configuration. It was clearly described why it was not recommended.

    public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
        // do some hadoop stuff, like counting words
        String path = "newFile.txt";
        try {
            File f = new File(path);
            f.createNewFile();
        } catch (IOException e) {
            System.out.println("Message easy to look up in the logs.");
            System.err.println("Error easy to look up in the logs.");
            e.printStackTrace();
            throw e;
        }
    }