Search code examples
looker-studio

GDS Data categorisation/homogenisation, possible with RegEx?


I'm trying to homogenise my data in GDS, specifically the names of companies/organisations.

In this example, I'm trying to use GDS to explore participant data using self-filled out questionnaires.

The issue here being is that people tend to represent their company name in different ways e.g. Google, Google LLC, GoogleLLCUSA etc.

So because there is a somewhat limited pool of companies the participants would be from, I had the idea of using the suffix of their email addresses to extract the company name. e.g. [email protected] = Google.

I've made a table in google sheets with a list of keywords to look for in the email field, and what companies they represent, but I'm struggling to make this compatible in GDS.

Up until now i've been writing out the alternatives in a calculated field using Regex Match e.g.

case
      when REGEXP_MATCH(email,'.*(?i)google.*') THEN "Google"

else Organisation - Remove LLP END

but as there are over 600 different companies we deal with, this isn't a viable option due to the 255 case limit in GDS.

I've blended the two data sources together (participant data with the email addresses & the spreadsheet with the suffixes/company names, but constantly get errors when using the variables in RegEx Match statements.

E.g. (I've tried a few different formatting layouts too but can't seem to make it work error free)

CASE WHEN REGEXP_MATCH(EmailInParticipantDatabase, KeywordIn2ndSpreadsheet) THEN CompanyNameIn2ndSpreadsheet else Organisation END

Mostly I just get a generic 'syntax error' with no more detail, and occasionally (Syntax error: Expected ")" but got identifier "virtualt0c0d1")

Is it possible to use variables in this way or am I barking up the wrong tree?


Solution

  • Extracting the domain from an e-mail is possible in one step:

    LOWER(REGEXP_EXTRACT( user , '@([^.]+)[.]+[^.]+$' ))

    For a higher limit of the case when statements, these can be split in different field formulas. Do the first 200 when statements and save the field as test1. Create another field with the next 200 when statements and finish with else test1 end. Build the next field and so on. 600 statements should be no problem. Even more statements could slow down large datasets.