I am trying to create a UDF
in Hive
. This UDF
has to auto increment a hive
table column called id
.
Now the following is the Java
code to create the UDF
.
package myudf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
@UDFType(deterministic = false, stateful = true)
public class autoincrement extends UDF{
int lastValue;
public int evaluate() {
lastValue++;
return lastValue;
}
}
Now I am able create a jar file and add the jar file to hive like below:
add jar /home/cloudera/Desktop/increment.jar;
Then create a temporary function
create temporary function inc as 'myudf.autoincrement';
Create table like below.
Create table abc(id int, name string)
Insert values:
INSERT into TABLE abc SELECT inc() as id, 'Tim';
Do select statement:
select * from abc;
Output:
1 Tim
Insert values:
INSERT into TABLE abc SELECT inc() as id, 'John';
Do select statement:
select * from abc
Output:
1 Tim
1 John
But what I was expecting was when I insert values for the 2nd time.
My expected output was :
1 Tim
2 John
How to get the expected output. What should I change in the Java
code to get the desired result?
And Can I use the same function in Spark
as well
In spark when I do
sqlContext.sql("show functions")
It shows the list of all functions available in Hive
But when I do
sqlContext.sql("INSERT into TABLE abc SELECT inc() as id, 'Jim'")
I got the below error
pyspark.sql.utils.AnalysisException: u'undefined function inc; line 1 pos 29'
How to create the same UDF
in pyspark
and get the desired output
What happens when the insert statements are executed at the same time?
Follow the below steps
You have to try the correctness of the SQL in hive as I have checked and it works in MYSQL.