Search code examples
google-bigqueryregexp-replace

How to regexp_replace the first occurrence of decimal if there's more than one in bigquery/googlesql


This is what I have currently, trying to clean a messy string value to only have numbers. Everything else is fine, just the issue now is that the REGEX_REPLACE is moving all decimals. When there is more than one decimal, I want to remove the first occurrence only.

TRIM(REGEXP_REPLACE(CASE
  WHEN column = "-" THEN "0"
  WHEN column = "o" THEN "0"
  WHEN column IS NULL THEN "0"
  WHEN column LIKE '%o%' THEN REPLACE (column, 'o','0')
  WHEN column LIKE '%,%' THEN REPLACE (column, ',', '.')
  WHEN column LIKE '%,%' THEN REGEXP_REPLACE(column, r'\.', '')
  WHEN column LIKE '%/%' THEN "0"
  WHEN column LIKE '%kg%' THEN REPLACE (column, 'kg', '')
  WHEN LENGTH(column) - LENGTH(REPLACE(column, '.', '')) = 1 THEN column
  WHEN REGEXP_EXTRACT(column, 'n') IS NOT NULL THEN "0"
  WHEN REGEXP_EXTRACT(column, 'tb') IS NOT NULL THEN "0"
  WHEN REGEXP_EXTRACT(column, 'REF') IS NOT NULL THEN "0"
  ELSE column
  END,r's/.//', '')) AS Column

I tried a few other REGEX methods but due to lack of understanding it wasn't giving the desired outcome.

An example of this query running is below

Input is the column I'm querying, and output is the result of the query being run against the column. The desired output is what I want as a result of the query. This is being achieved for all rows apart from rows 2 and 5, this is where there are multiple occurrences of decimals, and I'm trying to remove only the first occurrence only

ID  | Input     |Output     |Desired Output
1   | 45988,8   |45988.8    |45988.8
2   | 45.856,90 |45.856.90  |45856.90
3   | 457.88    |457.88     |457.88
4   | 456,788   |456.788    |456.788
5   | 123.45.67 |123.45.67  |12345.67
6   | 45 kg     |45         |45
7   | tbc       |0          |0
8   | na        |0          |0
9   | REF       |0          |0




Solution

  • Consider below approach (BigQuery Standard SQL)

    SELECT id, input, 
      ( SELECT REVERSE(STRING_AGG(IF(OFFSET = 1, '.' || x, x), ''))
        FROM UNNEST(SPLIT(REPLACE(REGEXP_REPLACE(REVERSE(input), r'[^0-9.,]', r''), ',','.'), '.')) x WITH OFFSET 
      ) output
    FROM your_table    
    

    if applied to sample data in your question - output is

    enter image description here