Search code examples
joinleft-joindata-modelinglooker-studio

Represent on data studio entries duplicated by joins


I am working on a project building the ETL process and dashboard to control some KPI metrics. I have created a table in BigQuery where, once a month, I save some key values calculated by aggregating data extracted from other table. I am measuring emails sent by employees, so in order to calculate one of those key values I need to read from two different tables and perform a left join to match, from each of the company's working areas existing in the aggregation (left table), how many employees that area has (right join).

This is a simplification of my tables:

Sent emails, grouped by area

|  Area Id  |  Service  |  Bad employees  | ...
|     1     |   Gmail   |      3416       | ...
|     2     |   Gmail   |     10782       | ...
|     2     |   Groups  |      9267       | ...

Total number of employees, grouped by area

|  Area Id  |  Total employees  | ...
|     1     |       34124       | ...
|     2     |       82561       | ...
|     3     |       49472       | ...

The problem comes here: as you can see, the first table (sent emails) has a field which does not appear on the second one; I am talking about Service. For this reason, when I join both tables I will get duplicated values for the Total employees field:

Joined table

|  Area Id  |  Service  |  Bad employees  |  Total employees  |
|     1     |   Gmail   |      3416       |       34124       |
|     2     |   Gmail   |     10782       |       82561       |
|     2     |   Groups  |      9267       |       82561       |

This final table will be used to create a report in Data Studio. I want to keep the Service field in my final table as I want to give the users the option of filtering by it. I can not edit the employees table schema and add a Service field to its entries because that information is unique from the emails table, it represents the service from which the email was sent and has nothing to do with the employees table.

I am struggling to get a valid data modeling option for this problem; if go with this solution and I want to represent on Data Studio, let's say, the Total number of employees per selected areas, I will get the wrong value for those areas containing multiple services:

  • Total employees area 1: 34.124
  • Total employees area 2: 82.561 + 82.561 = 165.122
  • Total employees: 34.124 + 165.122 = 199.246
  • Expected value: 34.124 + 82.561 = 116.685

This will affect any metric using the total employees value.

How can I keep the Service field of my joined table and still represent on Data Studio the correct value for Total employees?


Solution

  • I solved the problem by using nested and repeated fields. I thought Data Studio couldn't filter by values inside a repeated field, but I have checked that it is possible so that fits perfectly my use case.

    Joined table schema:

    [
      {
        "mode": "REQUIRED",
        "name": "id",
        "type": "INTEGER"
      },
      { 
        "mode": "REPEATED",
        "name": "service",
        "type": "RECORD",
        "fields": [
          {
            "mode": "NULLABLE",
            "name": "name",
            "type": "STRING"
          },
          {
            "mode": "NULLABLE",
            "name": "bad_employees",
            "type": "INTEGER",
          }
        ]
      },
      {
        "mode": "NULLABLE",
        "name": "total_employees",
        "type": "INTEGER",
        "description": "Sum of the emails sent during off hours for all the sources"
      },
    ]
    

    Joined table representation:

    |    id     |  service.name  |  service.bad_employees  |  total_employees  |
    |     1     |     Gmail      |          3416           |       34124       |
    |     2     |     Gmail      |         10782           |       82561       |
    |           |     Groups     |          9267           |                   |
    

    This way, I can get the correct sum of bad_employees by performing SUM(service.bad_employees), and the correct value for total_employees with SUM(total_employees).

    Also, if I want to filter only by a certain service, I can add a control on the field service.name and it will filter properly.