Search code examples
arraysgoogle-bigquerystructure

Big Query Multiple Independent Arrays of Structures


I am trying to create two arrays of structures that are independent of each other.

I am trying to get these arrays created from another table that is an import of a flat file with different record types depending on the data (Rec0015 - Earnings and Rec0025 - Deductions). Each record type has a slightly different layout, so I import the record data as a string that is "~" delimited. I have simplified my examples below to only show the basic data needed to illustrate my issues.

If I run the queries independently of each other each array is created properly; however, when I combine both I cannot get it to work.

For the first query:

SELECT 
     flat.EmployeeNumber,
     ARRAY_AGG(STRUCT(
          flat.RecordNumberEarn,
          flat.EmployeeEarningsAmount
     ))
     as EmployeeEarningsDetail
FROM 
(SELECT DISTINCT
     stage.EmployeeNumber,
     Rec0015.RecordNumberEarn,
     Rec0015.EmployeeEarningsAmount,
FROM  bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging stage
INNER JOIN 
(SELECT EmployeeNumber,
 SPLIT(VariableData, '~')[SAFE_OFFSET(0)] AS RecordNumberEarn,
 SPLIT(VariableData, '~')[SAFE_OFFSET(4)] AS EmployeeEarningsAmount
 FROM bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging
 WHERE RecordTypeNumber = '0015'
) Rec0015
ON stage.EmployeeNumber = Rec0015.EmployeeNumber  
) as flat 
GROUP BY
     flat.EmployeeNumber

I get the following result (which is correct):

EmployeeNumber  EmployeeEarningsDetail
xxxx521     "{
  ""EmployeeEarningsDetail"": [{
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }]
}" ....

When I run the "full" query:

SELECT 
     flat2.EmployeeNumber,
     ARRAY_AGG(STRUCT(
          Rec0025.RecordNumberDed,
          Rec0025.EmployeePayDeductionAmount
     ))
     as EmployeeDeductionsDetail
FROM 

(SELECT 
     flat.EmployeeNumber,
     ARRAY_AGG(STRUCT(
          flat.RecordNumberEarn,
          flat.EmployeeEarningsAmount
     ))
     as EmployeeEarningsDetail
FROM 
(SELECT DISTINCT
     stage.EmployeeNumber,
     Rec0015.RecordNumberEarn,
     Rec0015.EmployeeEarningsAmount,
FROM  bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging stage
INNER JOIN 
(SELECT EmployeeNumber,
 SPLIT(VariableData, '~')[SAFE_OFFSET(0)] AS RecordNumberEarn,
 SPLIT(VariableData, '~')[SAFE_OFFSET(4)] AS EmployeeEarningsAmount
 FROM bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging
 WHERE RecordTypeNumber = '0015'
) Rec0015
ON stage.EmployeeNumber = Rec0015.EmployeeNumber  
) as flat 
GROUP BY
     flat.EmployeeNumber
) as flat2
INNER JOIN 
(SELECT DISTINCT EmployeeNumber,
 SPLIT(VariableData, '~')[SAFE_OFFSET(0)] AS RecordNumberDed,
 SPLIT(VariableData, '~')[SAFE_OFFSET(5)] AS EmployeePayDeductionAmount
 FROM bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging
 WHERE RecordTypeNumber = '0025'
) Rec0025 
ON flat2.EmployeeNumber = Rec0025.EmployeeNumber
GROUP BY
     flat2.EmployeeNumber

I get the following result (which is correct, but doesn't include the first array structure):

EmployeeNumber  EmployeeDeductionsDetail
xxxx521     "{
  ""EmployeeDeductionsDetail"": [{
    ""RecordNumberDed"": ""0001"",
    ""EmployeePayDeductionAmount"": ""       50.65""
  }, {
    ""RecordNumberDed"": ""0002"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0003"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0004"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0005"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0006"",
    ""EmployeePayDeductionAmount"": ""       61.54""
  }, {
    ""RecordNumberDed"": ""0007"",
    ""EmployeePayDeductionAmount"": ""       13.22""
  }, {
    ""RecordNumberDed"": ""0008"",
    ""EmployeePayDeductionAmount"": ""        7.84""
  }, {
    ""RecordNumberDed"": ""0009"",
    ""EmployeePayDeductionAmount"": ""        0.69""
  }, {
    ""RecordNumberDed"": ""0010"",
    ""EmployeePayDeductionAmount"": ""        5.00""
  }]
}" ...

However, in the 2nd query, I really want to do a group by on the array structure EmployeeEarningsDetail, but when I add the array to the select and group by I get the error:

"Grouping by expressions of type ARRAY is not allowed."

I tried adding a TO_JSON_STRING(EmployeeEarningsDetail) in both the Select and the Group by, but I got a column of just the string not as an array as below:

SELECT 
     flat2.EmployeeNumber,
     TO_JSON_STRING(flat2.EmployeeEarningsDetail),
     ARRAY_AGG(STRUCT(
          Rec0025.RecordNumberDed,
          Rec0025.EmployeePayDeductionAmount
     ))
     as EmployeeDeductionsDetail
FROM 

(SELECT 
     flat.EmployeeNumber,
     ARRAY_AGG(STRUCT(
          flat.RecordNumberEarn,
          flat.EmployeeEarningsAmount
     ))
     as EmployeeEarningsDetail
FROM 
(SELECT DISTINCT
     stage.EmployeeNumber,
     Rec0015.RecordNumberEarn,
     Rec0015.EmployeeEarningsAmount,
FROM  bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging stage
INNER JOIN 
(SELECT EmployeeNumber,
 SPLIT(VariableData, '~')[SAFE_OFFSET(0)] AS RecordNumberEarn,
 SPLIT(VariableData, '~')[SAFE_OFFSET(4)] AS EmployeeEarningsAmount
 FROM bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging
 WHERE RecordTypeNumber = '0015'
) Rec0015
ON stage.EmployeeNumber = Rec0015.EmployeeNumber  
) as flat 
GROUP BY
     flat.EmployeeNumber
) as flat2
INNER JOIN 
(SELECT DISTINCT EmployeeNumber,
 SPLIT(VariableData, '~')[SAFE_OFFSET(0)] AS RecordNumberDed,
 SPLIT(VariableData, '~')[SAFE_OFFSET(5)] AS EmployeePayDeductionAmount
 FROM bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging
 WHERE RecordTypeNumber = '0025'
) Rec0025 
ON flat2.EmployeeNumber = Rec0025.EmployeeNumber
GROUP BY
     flat2.EmployeeNumber,
     TO_JSON_STRING(flat2.EmployeeEarningsDetail)

The results were (not correct at all) are below the JSON shows an f0_ string of all the earnings and the first Deduction followed by another row with the rest of the deductions in an array See below:

EmployeeNumber  f0_ EmployeeDeductionsDetail
xxxx521     "[{""RecordNumberEarn"":""0001"",""EmployeeEarningsAmount"":""      375.52""},{""RecordNumberEarn"":""0002"",""EmployeeEarningsAmount"":""      387.26""}]" "{
  ""EmployeeDeductionsDetail"": [{
    ""RecordNumberDed"": ""0001"",
    ""EmployeePayDeductionAmount"": ""       50.65""
  }, {
    ""RecordNumberDed"": ""0002"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0003"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0004"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0005"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0006"",
    ""EmployeePayDeductionAmount"": ""       61.54""
  }, {
    ""RecordNumberDed"": ""0007"",
    ""EmployeePayDeductionAmount"": ""       13.22""
  }, {
    ""RecordNumberDed"": ""0008"",
    ""EmployeePayDeductionAmount"": ""        7.84""
  }, {
    ""RecordNumberDed"": ""0009"",
    ""EmployeePayDeductionAmount"": ""        0.69""
  }, {
    ""RecordNumberDed"": ""0010"",
    ""EmployeePayDeductionAmount"": ""        5.00""
  }]
}" ...

I have tried to put both ARRAY_AGGs in the same SELECT my arrays are cartesian products:

The SQL is:

SELECT 
     flat.EmployeeNumber,
     ARRAY_AGG(STRUCT(
          flat.RecordNumberEarn,
          flat.EmployeeEarningsAmount
     ))
     as EmployeeEarningsDetail,
     ARRAY_AGG(STRUCT(
          flat.RecordNumberDed,
          flat.EmployeePayDeductionAmount
     ))
     as EmployeeDeductionsDetail
FROM 
(SELECT DISTINCT
     stage.EmployeeNumber,
     Rec0015.RecordNumberEarn,
     Rec0015.EmployeeEarningsAmount,
     Rec0025.RecordNumberDed,
     Rec0025.EmployeePayDeductionAmount
FROM  bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging stage
INNER JOIN 
(SELECT EmployeeNumber,
 SPLIT(VariableData, '~')[SAFE_OFFSET(0)] AS RecordNumberEarn,
 SPLIT(VariableData, '~')[SAFE_OFFSET(4)] AS EmployeeEarningsAmount
 FROM bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging
 WHERE RecordTypeNumber = '0015'
) Rec0015
ON stage.EmployeeNumber = Rec0015.EmployeeNumber  
INNER JOIN 
(SELECT DISTINCT EmployeeNumber,
 SPLIT(VariableData, '~')[SAFE_OFFSET(0)] AS RecordNumberDed,
 SPLIT(VariableData, '~')[SAFE_OFFSET(5)] AS EmployeePayDeductionAmount
 FROM bq_hpm_ppm_dev.EmployeeGrossToNetPayStaging
 WHERE RecordTypeNumber = '0025'
) Rec0025 
ON stage.EmployeeNumber = Rec0025.EmployeeNumber
) as flat
GROUP BY
     flat.EmployeeNumber

and the results are:

EmployeeNumber  EmployeeEarningsDetail  EmployeeDeductionsDetail
xxxx521     "{
  ""EmployeeEarningsDetail"": [{
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0001"",
    ""EmployeeEarningsAmount"": ""      375.52""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }, {
    ""RecordNumberEarn"": ""0002"",
    ""EmployeeEarningsAmount"": ""      387.26""
  }]
}"  "{
  ""EmployeeDeductionsDetail"": [{
    ""RecordNumberDed"": ""0001"",
    ""EmployeePayDeductionAmount"": ""       50.65""
  }, {
    ""RecordNumberDed"": ""0002"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0003"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0004"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0005"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0006"",
    ""EmployeePayDeductionAmount"": ""       61.54""
  }, {
    ""RecordNumberDed"": ""0007"",
    ""EmployeePayDeductionAmount"": ""       13.22""
  }, {
    ""RecordNumberDed"": ""0008"",
    ""EmployeePayDeductionAmount"": ""        7.84""
  }, {
    ""RecordNumberDed"": ""0009"",
    ""EmployeePayDeductionAmount"": ""        0.69""
  }, {
    ""RecordNumberDed"": ""0010"",
    ""EmployeePayDeductionAmount"": ""        5.00""
  }, {
    ""RecordNumberDed"": ""0001"",
    ""EmployeePayDeductionAmount"": ""       50.65""
  }, {
    ""RecordNumberDed"": ""0002"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0003"",
    ""EmployeePayDeductionAmount"": ""       44.15""
  }, {
    ""RecordNumberDed"": ""0004"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0005"",
    ""EmployeePayDeductionAmount"": ""       10.33""
  }, {
    ""RecordNumberDed"": ""0006"",
    ""EmployeePayDeductionAmount"": ""       61.54""
  }, {
    ""RecordNumberDed"": ""0007"",
    ""EmployeePayDeductionAmount"": ""       13.22""
  }, {
    ""RecordNumberDed"": ""0008"",
    ""EmployeePayDeductionAmount"": ""        7.84""
  }, {
    ""RecordNumberDed"": ""0009"",
    ""EmployeePayDeductionAmount"": ""        0.69""
  }, {
    ""RecordNumberDed"": ""0010"",
    ""EmployeePayDeductionAmount"": ""        5.00""
  }]
}" ...

Any suggestions on how I can "fix" this.

Thanks, David


Solution

  • A workaround for GROUP BY when the type does not support aggregation, e.g. STRUCT or GEOGRAPHY is to turn it to a STRING key. TO_JSON_STRING can do it for STRUCT, ST_AsText can do it for GEOGRAPHY, etc. Note this might not be very performant.

    To get the original value, without any modification, use ANY_VALUE aggregation function - it picks arbitrary value from its inputs, here all the values are presumably the same, so we don't care which one.