Search code examples
apache-sparkhivepysparkudf

Hive auto increment UDF doesn't give desired results


I am trying to create a UDF in Hive. This UDF has to auto increment a hive table column called id.

Now the following is the Java code to create the UDF.

package myudf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;


@UDFType(deterministic = false, stateful = true)
public class autoincrement extends UDF{

      int lastValue;

    public int evaluate() {

     lastValue++;

        return lastValue;

   }

} 

Now I am able create a jar file and add the jar file to hive like below:

add jar /home/cloudera/Desktop/increment.jar;

Then create a temporary function

create temporary function inc as 'myudf.autoincrement';

Create table like below.

Create table abc(id int, name string)

Insert values:

INSERT into TABLE abc SELECT inc() as id, 'Tim';

Do select statement:

select * from abc;

Output:

1  Tim

Insert values:

INSERT into TABLE abc SELECT inc() as id, 'John';

Do select statement:

select * from abc

Output:

1  Tim
1  John

But what I was expecting was when I insert values for the 2nd time.

My expected output was :

1  Tim
2  John

How to get the expected output. What should I change in the Java code to get the desired result?

And Can I use the same function in Spark as well

In spark when I do

sqlContext.sql("show functions") 

It shows the list of all functions available in Hive

But when I do

sqlContext.sql("INSERT into TABLE abc SELECT inc() as id, 'Jim'")

I got the below error

pyspark.sql.utils.AnalysisException: u'undefined function inc; line 1 pos 29'

How to create the same UDF in pyspark and get the desired output

What happens when the insert statements are executed at the same time?


Solution

  • Follow the below steps

    1. change your insert to INSERT into TABLE abc SELECT max(id)+1 as id, 'Tim' from abc;
      or
    2. Modify the UDF to take int column as input and return input+1
    3. modify your insert to INSERT into TABLE abc SELECT inc(max(id)) as id, 'Tim' from abc;

    You have to try the correctness of the SQL in hive as I have checked and it works in MYSQL.