Search code examples
mapreducehdfsclouderadistributed-cache

Reading HAR file from DistributedCache in mapreduce


I've written an oozie workflow which creates HAR archive and then runs MR-job which needs to read data from this archive. 1. Archive is created 2. When job runs, mapper does see archive in distributed cache. 3. ??? How Can I read this arhive? What's the API to read data from this archive line by line (my har is batch of multiple new line separated text files). NB: It work perfectly when i work with usual files (not HAR archive) stored in DistirubtedCache. I got a problem while trying to read data from HAR.

Here is a code snippet:

    InputStream inputStream;
    String cachedDatafileName = System.getProperty(DIST_CACHE_FILE_NAME);
    LOG.info(String.format("Looking for[%s]=[%s] in DistributedCache",DIST_CACHE_FILE_NAME, cachedDatafileName));

    URI[] uris = DistributedCache.getCacheArchives(getContext().getConfiguration());
    URI uriToCachedDatafile = null;
    for(URI uri : uris){
        if(uri.toString().endsWith(cachedDatafileName)){
            uriToCachedDatafile = uri;
            break;
        }
    }
    if(uriToCachedDatafile == null){
        throw new RuntimeConfigurationException(String.format("Looking for[%s]=[%s] in DistributedCache failed. There is no such file",
                DIST_CACHE_FILE_NAME, cachedDatafileName));
    }

    Path pathToFile = new Path(uriToCachedDatafile);
    LOG.info(String.format("[%s] has been found. Uri is: [%s]. The path is:[%s]",cachedDatafileName, uriToCachedDatafile, pathToFile));

    FileSystem fileSystem =  pathToFile.getFileSystem(getContext().getConfiguration());
    HarFileSystem harFileSystem = new HarFileSystem(fileSystem);
    inputStream = harFileSystem.open(pathToFile); //NULL POINTER EXCEPTION IS HERE!
    return inputStream;

Solution

  • protected InputStream getInputStreamToDistCacheFile() throws IOException{
            InputStream inputStream;
            String cachedDatafileName = System.getProperty(DIST_CACHE_FILE_NAME);
            LOG.info(String.format("Looking for[%s]=[%s] in DistributedCache",DIST_CACHE_FILE_NAME, cachedDatafileName));
    
            URI[] uris = DistributedCache.getCacheArchives(getContext().getConfiguration());
            URI uriToCachedDatafile = null;
            for(URI uri : uris){
                if(uri.toString().endsWith(cachedDatafileName)){
                    uriToCachedDatafile = uri;
                    break;
                }
            }
            if(uriToCachedDatafile == null){
                throw new RuntimeConfigurationException(String.format("Looking for[%s]=[%s] in DistributedCache failed. There is no such file",
                        DIST_CACHE_FILE_NAME, cachedDatafileName));
            }
    
            //Path pathToFile = new Path(uriToCachedDatafile +"/stf/db_bts_stf.txt");
            Path pathToFile = new Path("har:///"+"home/ssa/devel/megalabs/kyc-solution/kyc-mrjob/target/test-classes/GSMCellSubscriberHomeIntersectionJobDescriptionClusterMRTest/in/gsm_cell_location_stf.har" +"/stf/db_bts_stf.txt");
            //Path pathToFile = new Path(("har://home/ssa/devel/megalabs/kyc-solution/kyc-mrjob/target/test-classes/GSMCellSubscriberHomeIntersectionJobDescriptionClusterMRTest/in/gsm_cell_location_stf.har"));
    
            LOG.info(String.format("[%s] has been found. Uri is: [%s]. The path is:[%s]",cachedDatafileName, uriToCachedDatafile, pathToFile));
            FileSystem harFileSystem = pathToFile.getFileSystem(context.getConfiguration());
            FSDataInputStream fin = harFileSystem.open(pathToFile);
            LOG.info("fin: " + fin);
    //        FileSystem fileSystem =  pathToFile.getFileSystem(getContext().getConfiguration());
    //        HarFileSystem harFileSystem = new HarFileSystem(fileSystem);
    //        harFileSystem.exists(new Path("har://home/ssa/devel/mycompany/my-solution/my-mrjob/target/test-classes/HomeJobDescriptionClusterMRTest/in/locations.har"));
    //        LOG.info("harFileSystem.exists(pathToFile):"+ harFileSystem.exists(pathToFile));
    //        harFileSystem.initialize(uriToCachedDatafile, context.getConfiguration());
    
    
    
            FileStatus[] statuses = harFileSystem.listStatus(new Path("har:///"+"har://home/ssa/devel/mycompany/my-solution/my-mrjob/target/test-classes/HomeJobDescriptionClusterMRTest/in/locations.har"));
            for(FileStatus fileStatus : statuses){
                LOG.info("fileStatus isDir"+fileStatus.isDirectory() +" len:" + fileStatus.getLen());
            }
    
    //        String tmpPathToFile = "har:///"+pathToFile.toString(); //+"/stf/db_bts_stf.txt";
    //        Path tmpPath = new Path(tmpPathToFile);
    //        LOG.info("KILL ME PATH TO FILE IN ARCHIVE: " +tmpPath);
    //        inputStream = harFileSystem.open(tmpPath);
    //        return inputStream;
            return fin;
        }
    

    As you can see, it's terrible. You have manually read index file stored inside archive and reconstruct paths using index file metadata. If you know the exact name of a file stored in archive (like in my example), you can construct paths manually.

    It's not convinient.I did expect something like Zip->zipEntry, when you can iterate over entries of archive without knowing it's structure.