Search code examples
regexhiveqlregexp-replace

Extracting digits after certain characters that appear more than once from a string field in Hive


I'm trying to extract all digits that appear after 'dd ->'

I've figured out how to extract first occurence of digits after 'dd ->': regexp_extract(string, 'dd\\s->\\s([0-9]+)') and how to replace all characters except digits regexp_replace(string, '[^0-9]+', '') but failed to find a solution

String: (dd -> 2192, bar -> 1), (dd -> 2670, bar -> 1), (dd -> 2487, bar -> 3),(dd -> 2346, bar -> 3) kk=67457 ghyt=1628 nn=8.67.1

Desired output: 2192 2670 2487 2346

Thanks!


Solution

  • Use

    dd ->( [0-9]+)|.
    

    Replace with $1.

    See regex proof.

    EXPLANATION

    --------------------------------------------------------------------------------
      dd ->                    'dd ->'
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
                                 ' '
    --------------------------------------------------------------------------------
        [0-9]+                   any character of: '0' to '9' (1 or more
                                 times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
     |                        OR
    --------------------------------------------------------------------------------
      .                        any character except \n
    

    Trim the first space if needed.