Search code examples
google-cloud-platformgoogle-cloud-dataflowapache-beambeam-sql

Beam SQL / Apache Beam is Slower when Running Multiple Joins


While doing joins on 2 tables using Beam SQL then it's working properly provide expected performance but as my Joining Tables increases then the performance become worst.

Below is my snippet which might help you to debug my Joining condition in Beam SQL to get better performance.

PCollection<Row> outputStream2 = PCollectionTuple.of(new TupleTag<>("corporation1"), sourceData)
                .and(new TupleTag<>("dim"), dimtable).and(new TupleTag<>("place"), placeData)
                .and(new TupleTag<>("principle"), principle).apply(SqlTransform.query(
                        "Select d.merchant,d.corporation1,d.place,d.principal,c.corporation1_sk,r.place_sk,p.principal_sk FROM dim d LEFT JOIN corporation1 c ON c.corporation1 = d.corporation1 LEFT JOIN place p ON p.place = d.place and c.corporation1 = p.corporation1 "));

Any better way I can do Joins on Beam SQL/ Apache Beam because Sequential Joins in Table,

Where the previous output is responsible for next table Joins. I have tried using Co-GroupBy and SideInput mixed approach also where data in Table is below 5K I took the SideInput and where data is above 50K used Co-GroupBy for Joining but the performance is not upto the mark.


Solution

  • It looks like you're seeing something similar to this issue, which currently doesn't have an ETA for a fix. Beam SQL itself doesn't do a lot of JOIN optimizations at the moment, it chooses the most appropriate approach (side-input, CoGBK) based on what kind of inputs it receives, but that's pretty much it, you cannot control it otherwise.

    It is hard to tell for sure without knowing your specific setup, e.g. what kind of data sources you have, or how did you make sure to use side-input vs CoGBK, or what runner you use, or what performance you expect vs what you're actually observing.

    Relevant links, if you need more details: