I've created a Hive UDF that parses a URL. The URL contains query parameters. When I parse the input in my UDF, however, characters like '=' and '&' are converted to gibberish.
Initially, I was relying on String's toString()
method to convert the Hive Text
to Java String. The above characters are converted to gibberish with this approach. I then tried using the new String(str, StandardCharsets.UTF_8)
to convert the Hive Text
to Java String
. This worked at first. Then, it started producing gibberish as well.
My method is shown below. Any ideas on what I might not be doing right?
public Text evaluate(final Text requestInput, final Text referrerInput) {
if (requestInput == null || referrerInput == null)
return null;
final String request = new String(requestInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish
final String referrer = new String(referrerInput.getBytes(), StandardCharsets.UTF_8); // converts '=' and '&' in URL strings to gibberish
}
When I run HQL in Hive:
SELECT get_json_object(json, '$.base.request_url') FROM events
I get this:
GET /api/get_info?id=1465473313746 HTTP/1.1
In my UDF, the toString()
method (no additional processing) produces the following output:
GET /api/get_info?id\u003d1465473313746 HTTP/1.1
I learned that the =
and &
were being converted to their Unicode equivalents. Why this was happening is still unclear to me. Using Apache Commons StringEscapeUtils utility, the problem became easier:
StringEscapeUtils.unescapeJava(requestInput.toString())
solved the issue.