Search code examples
postgresqldbt

Create rows from part of column names


Source data

I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.

In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.

The resulting jsonb column in the raw data table looks like this:

id activity
12345678 {"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432 {"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}

JSON to columns

Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:

id Date (05/19/2020) Type (05/19/2020) Date (06/03/2020) Type (06/03/2020) Type (10/23/2020 Date (10/23/2020) Type (10/23/2020)
10629465 null null 06/01/2020 E
98765432 05/18/2020 B 10/26/2020 T

Need to move part of column name to row

Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:

id ActivityDate Date Type
12345678 05/19/2020 null null
12345678 06/03/2020 06/01/2020 E
98765432 05/19/2020 05/18/2020 B
98765432 10/23/2020 10/26/2020 T

Solution

  • I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.

    A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:

    with parse as (
      select id, e.k, e.v,
             regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
             regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
        from rawinput
             cross join lateral jsonb_each_text(activity) as e(k, v)
    )
    select id,
           k_date_only as activity_date,
           min(v) filter (where k_no_date = 'Date') as date,
           min(v) filter (where k_no_date = 'Type') as type
      from parse 
     group by id, k_date_only;
    

    db<>fiddle here