Search code examples
hadoophiveelastic-map-reduceemr

LeaseExpiredException with custom UDF in Hive


I have a Hive UDF which is supposed to extract the device from an UA string. It uses the ua-parser library: https://github.com/tobie/ua-parser

The UDF is rather simple:

public class DeviceTypeExtractTest extends UDF{
private  Text result = new Text();
private static final Parser uaParser;
  static {
    try {
      uaParser = new Parser();
    }
    catch(IOException e) {
      throw new RuntimeException("Could not instantiate User-Agent parser.");
    }
  }

public Text evaluate( Text uaField){
    if (uaField == null ) {
        return null;
    }

    try
    {
        String uaString = uaField.toString();
        Client client = uaParser.parse(uaString);
        result.set(client.device.family);
        return result;
    }
    catch(Exception e)
    {
        return null;
    }
  }
}

And it works just fine when run on a small dataset.

create table categories(
                    cat string);
insert overwrite table categories select DEVICE_TYPE_EXTRACT(user_agent) from raw_logs;

However, when testing this on a larger dataset of over 10 million rows, I get this LeaseExpiredException on every attempt: http://pastebin.com/yK6Qmx6r

And my map and reduce processes remain stuck at 0% for hours. Note that if I take out this udf and use some internal Hive UDFs just for testing, this behavior does not take place.

I am running this on an Amazon EMR cluster with AMI version 2.4.5 (Hive 0.11.0.2 and Hadoop 1.0.3).

I tried increasing the performance of the cluster by deploying better hardware, but I get the same problem with any hardware scenario.

Any ideas?


Solution

  • Okay, scratch that. It seems that after upgrading my instance, things started to move around but I was just not waiting long enough for the mapping to happen. And the LeaseExpiredError was actually thrown because of little ol' me when I was killing the processes.

    Still, the parsing is taking an immense amount of time and I would love some suggestions to further optimize this UDF.