Search code examples
sqlregexgoogle-bigqueryre2

How can I extract all text between single full stops (i.e. ignore `...`) with a re2 regex?


How can I extract all text between single full stops (i.e. ignore ...) with a re2 regex?

I am using the REGEXP_EXTRACT_ALL function in BigQuery which uses https://github.com/google/re2/wiki/Syntax.

From the following example:

This is... a.. sentence. It is just an example.

I would like the query to extract

This is... a.. sentence. and It is just an example.

I am specifically interested in whether this is possible using SQL functions in BigQuery rather than introducing another tool


Solution

  • Consider below workaround

    select text, regexp_replace(sentence, r'(#)(\.+)(#)', r'\2') sentence
    from `project.dataset.table`, 
    unnest(split(trim(regexp_replace(regexp_replace(text, r'(\.+)', r'#\1#'), r'(\#\.\#)', r'####'), '####'), '####')) sentence 
    

    if applied to sample data in your question - output is

    enter image description here