Search code examples
hadoopmapreducegeoipdistributed-cache

Accessing Maxmind Geo API in Hadoop using Distributed Cache


I am writing a MapReduce job to analyze web logs. My code is intended to map ip addresses to geo locations and I am using Maxmind Geo API(https://github.com/maxmind/geoip-api-java) for that purpose. My code has a LookupService method that needs database file with ip to location matchings. I am trying to pass this database file using distributed cache. I tried doing this in 2 different ways

Case1:

Run the job passing the file from HDFS but it always throws an error saying "FILE NOT FOUND"

sudo -u hdfs hadoop jar \
 WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
/user/hdfs/GeoLiteCity.dat 

OR

sudo -u hdfs hadoop jar \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
GeoLocationDatasetDriver /user/hdfs/input /user/hdfs/out_put \
hdfs://sandbox.hortonworks.com:8020/user/hdfs/GeoLiteCity.dat

Driver Class Code:

Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.addCacheFile(new Path(args[2]).toUri()); 

Mapper Class Code:

public void setup(Context context) throws IOException
{
URI[] uriList = context.getCacheFiles();
Path database_path = new Path(uriList[0].toString());
LookupService cl = new LookupService(database_path.toString(),
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

CASE 2: Run the code by the passing the file from local file system through the -files option. Error: Null Pointer exception in the line LookupService cl = new LookupService(database_path)

sudo -u hdfs hadoop jar  \
WebLogProcessing-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
com.prithvi.mapreduce.logprocessing.ipgeo.GeoLocationDatasetDriver \
-files /tmp/jobs/GeoLiteCity.dat /user/hdfs/input /user/hdfs/out_put \
GeoLiteCity.dat

Driver Code:

Configuration conf = getConf();
Job job = Job.getInstance(conf);
String dbfile = args[2];
conf.set("maxmind.geo.database.file", dbfile);

Mapper Code:

public void setup(Context context) throws IOException
{
  Configuration conf = context.getConfiguration();
  String database_path = conf.get("maxmind.geo.database.file");
  LookupService cl = new LookupService(database_path,
            LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);
}

I need this database file in all my task trackers to accomplish the job. Can any one please suggest me the right way to do so?


Solution

  • Try doing this:

    From the driver specify where the file in HDFS is like so using the Job object:

    job.addCacheFile(new URI("hdfs://localhot:8020/GeoLite2-City.mmdb#GeoLite2-City.mmdb"));
    

    where, # represents an alias name (symbolic link) to be created by hadoop

    After that you can access the file from Mapper in the setup() method:

    @Override
    protected void setup(Context context) {
      File file = new File("GeoLite2-City.mmdb");
    }
    

    Here is an example: