Search code examples
google-cloud-dataflowapache-beambeam-sql

BeamSQL Group By query problem with Float value


Tried to get the unique value from the BigQuery table using BeamSQL in Google Dataflow. Using Group By clause implemented the condition in BeamSQL (sample query below). One of the column has float data type. While executing the Job got below exceptions,

Caused by: org.apache.beam.sdk.coders.Coder$NonDeterministicException: org.apache.beam.sdk.coders.RowCoder@81d6d10 is not deterministic because: All fields must have deterministic encoding. Caused by: org.apache.beam.sdk.coders.Coder$NonDeterministicException: FloatCoder is not deterministic because: Floating point encodings are not guaranteed to be deterministic.

BeamSQL Query:

PCollection ST= mainColl.apply(SqlTransform.query("SELECT ID,ITEM,UNITPRICE FROM PCOLLECTION GROUP BY ID,ITEM,UNITPRICE"));

It would be good if some one help me to solve this issue.

Please note, if we remove the float column then BeamSQL query works fine.


Solution

  • This is indicating that you should not use floating point values (in this case probably UNITPRICE value) in the aggregation (group by) scheme, because their output is non-deterministic (i.e. it can change based on the precision change). For instance, consider this example:

    WITH
      data AS (
      SELECT 100 AS id, 'abc' as item, 0.3448473362800000001 AS unitprice
      UNION ALL
      SELECT 200 AS id, 'xyz' as item, 0.49300013 AS unitprice
      UNION ALL
      SELECT 500 AS id, 'pqr' as item, 0.67322332200000212 AS unitprice
    )
    select id, item, unitprice from data
    group by id, item, unitprice
    

    Output for this comes to be:

    100 abc 0.34484733628    
    200 xyz 0.49300013   
    500 pqr 0.6732233220000021
    

    in which, the unitprice values look a bit different.

    To avoid this, you can go two routes:

    • You can cast your unitprice as a string and then continue with grouping. Something like cast(unitprice as string) as unitprice in your query.
    • You can simply choose to keep unitprice as non-grouped entity (which is a logical option in most cases), and just do max(unitprice) as unitprice or avg(unitprice) as unitprice in your query, while grouping by id, item.

    Hope this helps.