Search code examples
google-bigquerylegacy-sql

How do I split a string column into multi rows of single words & word pairs in BigQuery SQL?


I am trying (unsuccessfully) to split a string column in Google BigQuery into rows containing all single words and all word pairs (next to each other & in order). I also need to maintain the ID field for the words from the IndataTable. Both recordsets have 2 columns.

IndataTable as IDT
ID WordString
1 apple banana pear
2 carrot
3 blue red green yellow

OutdataTable as ODT
ID WordString
1 apple
1 banana
1 pear
1 apple banana
1 banana pear
2 carrot
3 blue
3 red
3 green
3 yellow
3 blue red
3 red green
3 green yellow (only pairs that are next to each other)

Is this possible in BigQuery SQL?

Edit/Added:
This is what I have so far which works for splitting it up into single words. I am really struggling to figure out how to expand this to word pairs. I don't know if this can be modified for it or I need a new approach altogether.

SELECT ID, split(WordString,' ') as Words
FROM (
  select * 
     from 
     (select ID, WordString from IndataTable)
)

Solution

  • Below is for BigQuery Standard SQL

    #standardSQL
    WITH IndataTable AS (
      SELECT 1 id, 'apple banana pear' WordString UNION ALL
      SELECT 2, 'carrot' UNION ALL
      SELECT 3, 'blue red green yellow' 
    ), words AS (
      SELECT id, word, pos
      FROM IndataTable, UNNEST(SPLIT(WordString,' ')) AS Word WITH OFFSET pos
    ), pairs AS (
      SELECT id, CONCAT(word, ' ', LEAD(word) OVER(PARTITION BY id ORDER BY pos)) pair
      FROM words
    )
    SELECT id, word AS WordString FROM words UNION ALL
    SELECT id, pair AS WordString FROM pairs
    WHERE NOT pair IS NULL
    ORDER BY id  
    

    with result as expected :

    Row id  WordString   
    1   1   apple    
    2   1   banana   
    3   1   pear     
    4   1   apple banana     
    5   1   banana pear  
    6   2   carrot   
    7   3   blue     
    8   3   red  
    9   3   green    
    10  3   yellow   
    11  3   blue red     
    12  3   red green    
    13  3   green yellow