There is a UDF java class shown as below:
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
Hive actually supports Java primitives in UDFs (and a few other types, such as java.util.List and java.util.Map), so a signature like:
public String evaluate(String str)
would work equally well. However, by using Text we can take advantage of object reuse, which can bring efficiency savings, so this is preferred in general. Can someone tell me the reason why Text is preferred? Why we could take advantage of object reuse by using Text. When we execute the following command in Hive:
hive> SELECT strip(' bee ') FROM dummy;
After that we execute another command using that Strip function, then the Strip object is created again, right? So we cannot reuse it, right?
You can reuse a Text instance by calling one of the set() methods on it. For example:
Text t = new Text("hadoop");
t.set("pig");