Search code examples
google-bigqueryexecute-immediate

How to convert strings into column names in Google Bigquery?


I have a large data set in Google BigQuery with millions of rows of dirty data (App tracking) that I am trying to clean up. One of my problems is that the same data got sent to different columns for different events triggered in the App. By this I mean that maybe the country was sent to custom dimension 1 for some events, but to custom dimension 147 for other events. I can't post actual data, but a SELECT * FROM table_with_dirty_data would produce something like this:

date | session | eventAction | cd001 | cd002    | cd004    | cd005 
-----|---------|-------------|-------|----------|----------|-------
1    | 1       | 'event_1'   | '1'   | 'Pizza'  | null     | '21'
1    | 1       | 'event_2'   | '10'  | '25'     | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | '3.1' | null     | '15'     | 'France'
1    | 2       | 'event_1'   | '6'   | 'Burger' | null     | '21'
1    | 2       | 'event_2'   | '21'  | '25'     | 'Burger' | '12.6'

The ultimate goal here is to end up with a clean table that can be analyzed. Something like this:

date | session | eventAction | country  | vendor   | product  | price
-----|---------|-------------|----------|----------|----------|-------
1    | 1       | 'event_1'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_2'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 2       | 'event_1'   | 'Spain'  | '25'     | 'Burger' | '12.6'
1    | 2       | 'event_2'   | 'Spain'  | '25'     | 'Burger' | '12.6'

I am aware that some events will need some degree of statistical imputation and data type casting, but for now I am only concerned with getting every variable into its own column. So I've created an auxiliary table (let's call it matrix) that looks like this:

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_2'  | 'cd020'    | 'cd146'    | 'cd056'
'event_3'  | 'cd001'    | 'cd020'    | 'cd035'

and so on, where the value in each cell of column variable_# is the column name in the table_with_dirty_data where the information can be found. That is, variable_1 can be found in a column called cd020 for events with event_name 'event_1' and 'even_2', but in a column called cd001 for events with event_name 'event_3'. So basically what matrix does is to map to which custom dimension each variable was sent for each event.

I have hundreds of different events in table_with_dirty_data, and matrix holds all 200 GA custom dimensions, so doing something like

SELECT 
  CASE 
    WHEN event_name = 'event_1' THEN cd020
    WHEN event_name = 'event_2' THEN cd020
    WHEN event_name = 'event_3' THEN cd001
  END AS variable_1
  , CASE
      WHEN event_name = 'event_1' THEN cd035
      WHEN event_name = 'event_2' THEN cd146
      WHEN event_name = 'event_3' THEN cd020
    END AS variable_2
  , CASE
      WHEN event_name = 'event_1' THEN cd120
      WHEN event_name = 'event_2' THEN cd056
      WHEN event_name = 'event_3' THEN cd035
  END AS variable_3
FROM table_with_dirty_data

would take me a looooong time and is very error-prone. What I am trying to do is to use SELECT statements that return the column name (cd###) from the table_with_dirty_data where the information is stored and the use a WHILE to loop over all different events. So, for example, with event_name = 'event_1' it would be something like this:

SELECT 
  CASE
    WHEN event_name = 'event_1'
    THEN (SELECT variable_1 FROM matrix WHERE event_name = 'event_1')
  END AS variable_1
  , CASE
      WHEN event_name = 'event_1'
      THEN (SELECT variable_2 FROM matrix WHERE event_name = 'event_1')
    END AS variable_2
  , CASE
      WHEN event_name = 'event_1'
      THEN (SELECT variable_3 FROM matrix WHERE event_name = 'event_1')
    END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

The goal here is that I can then loop over an array with all the event_names (something that can be easily achieved). Ultimately I need all events in one table, but I'm ok with having a table for curating each event, as long as it can be done programmatically (something I personally don't know if it's even possible in GBQ...would have to check on that).

The issue is that the SELECT statements I am using get evaluated to a string, so the result of the queries inside the CASE clause are string literals. For example, if I were to run it for event = 'event_1',

SELECT variable_1 FROM matrix WHERE event_name = 'event_1'

evaluates to

'cd020'

which then cause the outer query to become

SELECT 
  CASE
    WHEN event_name = 'event_1' THEN 'cd020'
  END AS variable_1
  , CASE
      WHEN event_name = 'event_1' THEN 'cd035'
    END AS variable_2
  , CASE
      WHEN event_name = 'event_1' THEN 'cd120'
    END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

which produces a table like this

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'

instead of returning the value stored in columns cd020, cd035 or cd120 for each time the event 'event_1' is found.

Does anyone know a way to unquote the result of those inner queries so that when the outer query executes, they are transformed into column names (so 'cd020' becomes cd020)??

PS: I am also open to a totally different strategy if anyone knows one.


Solution

  • I solved the problem using the EXECUTE IMMEDIATE clause as recommended by @shollyman. I did it for just one of the events but I believe this answers the original question (scaling it to other events is just a matter of writing a WHILE loop). I'll go step by step since I'm not using an actual repex.

    Step one I declared a variable called event with the name of the event that the query needs to look for.

    DECLARE event STRING DEFAULT 'event_1';
    

    Then I declared a variable that holds the column name where the query needs to look for the variable.

    DECLARE variable_name STRING DEFAULT (SELECT variable_1 FROM matrix WHERE event_name = event);
    

    Then I wrote the query as I would normally do, but using the EXECUTE IMMEDIATE clause. I used triple double-quotes so that I could break it into several lines for readability).

    EXECUTE IMMEDIATE CONCAT("""
      SELECT
        CASE WHEN event_name = '""", event, "' THEN ", variable_name, """ END AS variable_1 
      FROM table_with_dirty_data
      WHERE event_name = '""", event, """'
    """);
    

    If others are going to use this, do take note of the lonely single quotes that I used before or after some triple double quotes. I did that because, for example, the declared variable event, even though it's a string, seems to get concatenated as event (no single quotes around it) and that breaks the query execution.