I am trying (unsuccessfully) to split a string column in Google BigQuery into rows containing all single words and all word pairs (next to each other & in order). I also need to maintain the ID field for the words from the IndataTable. Both recordsets have 2 columns.
IndataTable as IDT
ID WordString
1 apple banana pear
2 carrot
3 blue red green yellow
OutdataTable as ODT
ID WordString
1 apple
1 banana
1 pear
1 apple banana
1 banana pear
2 carrot
3 blue
3 red
3 green
3 yellow
3 blue red
3 red green
3 green yellow (only pairs that are next to each other)
Is this possible in BigQuery SQL?
Edit/Added:
This is what I have so far which works for splitting it up into single words. I am really struggling to figure out how to expand this to word pairs. I don't know if this can be modified for it or I need a new approach altogether.
SELECT ID, split(WordString,' ') as Words
FROM (
select *
from
(select ID, WordString from IndataTable)
)
Below is for BigQuery Standard SQL
#standardSQL
WITH IndataTable AS (
SELECT 1 id, 'apple banana pear' WordString UNION ALL
SELECT 2, 'carrot' UNION ALL
SELECT 3, 'blue red green yellow'
), words AS (
SELECT id, word, pos
FROM IndataTable, UNNEST(SPLIT(WordString,' ')) AS Word WITH OFFSET pos
), pairs AS (
SELECT id, CONCAT(word, ' ', LEAD(word) OVER(PARTITION BY id ORDER BY pos)) pair
FROM words
)
SELECT id, word AS WordString FROM words UNION ALL
SELECT id, pair AS WordString FROM pairs
WHERE NOT pair IS NULL
ORDER BY id
with result as expected :
Row id WordString
1 1 apple
2 1 banana
3 1 pear
4 1 apple banana
5 1 banana pear
6 2 carrot
7 3 blue
8 3 red
9 3 green
10 3 yellow
11 3 blue red
12 3 red green
13 3 green yellow