Search code examples
javahadoophiveudf

Hive UDF's treatment of URLs


I've created a Hive UDF that parses a URL. The URL contains query parameters. When I parse the input in my UDF, however, characters like '=' and '&' are converted to gibberish.

Initially, I was relying on String's toString() method to convert the Hive Text to Java String. The above characters are converted to gibberish with this approach. I then tried using the new String(str, StandardCharsets.UTF_8) to convert the Hive Text to Java String. This worked at first. Then, it started producing gibberish as well.

My method is shown below. Any ideas on what I might not be doing right?

public Text evaluate(final Text requestInput, final Text referrerInput) {
    if (requestInput == null || referrerInput == null)
        return null;

    final String request = new String(requestInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish
    final String referrer = new String(referrerInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish

}

When I run HQL in Hive:

SELECT get_json_object(json, '$.base.request_url') FROM events

I get this:

GET /api/get_info?id=1465473313746 HTTP/1.1

In my UDF, the toString() method (no additional processing) produces the following output:

GET /api/get_info?id\u003d1465473313746 HTTP/1.1


Solution

  • I learned that the = and & were being converted to their Unicode equivalents. Why this was happening is still unclear to me. Using Apache Commons StringEscapeUtils utility, the problem became easier:

    StringEscapeUtils.unescapeJava(requestInput.toString()) 
    

    solved the issue.