Search code examples
azureanalyticsazure-data-lakemaxmind

Map IP to Location with Azure data lake analytics


I have a lot of avro files with IPs (weblogs) stored in Azure Blobs. I would like to map the IP to location. How to do that with Azure Data Lake Analytics (ADLA)?

Right now I have a spark job that uses the Maxmind IP database with a java library that reads a 113MB big .mmdb file with all the ip-location data to do this lookup. I am now investigating if its possible to move this job over to ADLA

Maxmind provies a c# library as well, so that part is no problem. However, its not obvious to me how to deal with this big mmdb file that needs to be read and then used for the lookups. Obviously it would not be fast to read the file for each IP lookup. How can this (and similar cases) be handled with ADLA, or is ADLA unsuitable for this kinds of operations?

If I had a normal program running I would do the lookup like this:

using (var reader = new Reader("GeoIP2-City.mmdb"))
{
    foreach(var ip in ips)
    {
        var data = reader.Find<Dictionary<string, object>>(ip);
        ...
    }
}

The maxmind db is available here: https://dev.maxmind.com/geoip/geoip2/downloadable/ (note that I have bought the database Im currently using) and the c# library to read it here: https://github.com/maxmind/MaxMind-DB-Reader-dotnet


Solution

  • You can use U-SQL's DEPLOY RESOURCE statement and a UDO for this purpose.

    First, the file must be uploaded to you datalake store. Then use DEPLOY RESOURCE to tell the U-SQL system to copy that file to each vertex where the script runs. Then your script uses C# code to read the file.

    DEPLOY RESOURCE "/helloworld.txt";
    
    @departments =
      SELECT * 
      FROM (VALUES
          (31, "Sales"),
          (33, "Engineering"),
          (34, "Clerical"),
          (35, "Marketing")
        ) AS D( DepID, DepName );
    
    
    @departments =
         PROCESS @departments
         PRODUCE DepID int,
                 DepName string,
                 HelloWorld string
         USING new Demo.HelloWorldProcessor();
    
    OUTPUT @departments 
        TO "/departments.tsv"
        USING Outputters.Tsv();
    

    And here is the U-SQL Processor UDO.

    using Microsoft.Analytics.Interfaces;
    using Microsoft.Analytics.Types.Sql;
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Text;
    
    namespace Demo
    {
        [SqlUserDefinedProcessor]
        public class HelloWorldProcessor : IProcessor
        {
            private string hw;
    
            public HelloWorldProcessor()
            {
                this.hw = System.IO.File.ReadAllText("helloworld.txt");
            }
    
            public override IRow Process(IRow input, IUpdatableRow output)
            {
                output.Set<int>("DepID", input.Get<int>("DepID"));
                output.Set<string>("DepName", input.Get<string>("DepName"));
                output.Set<string>("HelloWorld", hw);
                return output.AsReadOnly();
            }
        }
    }