Search code examples
regexhivehiveqlregexp-replace

Strange behaviour of Regexp_replace in a Hive SQL query


I have some input information where I'm trying to remove the part .0 from my input where an ID string ends with .0.

select student_id, regexp_replace(student_id, '.0','') from school_result.credit_records where student_id like '%.0';

Input:

01-0230984.03
12345098.0
34567.0

Expected output:

01-0230984.03 
12345098
34567

But the result I'm getting is as follows: It's removing any character having with a 0 next to it instead of removing only the occurrences that end with .0

0129843
123498
34567

What am I doing wrong? Can someone please help?


Solution

  • Dot in regexp has special meaning (it means any character). If you need dot (.) literally, it should be shielded using double-slash (in Hive). Also add end-of-the-line anchor($):

    with mydata as (
    select stack(3,
    '01-0230984.03',
    '12345098.0',
    '34567.0'
    ) as str
    )
    
    select regexp_replace(str,'\\.0$','') from mydata;
    

    Result:

    01-0230984.03
    12345098
    34567
    

    Regexp '\\.0$' means dot zero (.0) literally, end of the line ($).