Search code examples
pythonsqlpyspark

Overcoming schemas/columns inconsistency across datasets


I've found out that same fields/columns have different business context for certain regions in my dataset.

I have few markets but lets take for sample example - GB and US. Most of those columns have the same meaning, but there are pairs which not, like:

SB1 for US is Strength Evaluation

SB2 for US is Power Evaluation

while

SB1 for GB is Power Evaluation

SB2 for GB is Strength Evaluation

and its case in whole dataset that one country SB1 is Power SB2 strength, for another its reversed and so on. Ideally looking for advice in PySpark, but while I will be working in Databricks, SQL may do as well.

My silver layer looks like that

ID Market CK SB1 SB2 SbX ColX
1 US 1US 2 1 9 9
2 US 2US 2 2 9 9
3 US 3US 1 1 9 9
1 GB 1GB 3 5 9 9
2 GB 2GB 4 4 9 9
3 GB 3GB 5 3 9 9

What is expected output in that scenario I guess is (look at SB1 SB2 cols)

ID Market CK SB1 SB2 SbX ColX
1 US 1US 2 1 9 9
2 US 2US 2 2 9 9
3 US 3US 1 1 9 9
1 GB 1GB 5 3 9 9
2 GB 2GB 4 4 9 9
3 GB 3GB 3 5 9 9

Each dataset got over 50 columns, and there are almost 10 markets

Any protips, ideas how to handle that? I guess it can't be solved on ingestion level, so raw and curated zone is not the place to make it happen I guess, the curated dataset has to be transformed and values filled accordingly


Solution

  • You can surely use standard SQL feature - CASE..WHEN as follows:

    select id, market, ck,
           case market when 'US' then SB1
                       when 'GB' then SB2
           end as SB1,
           case market when 'US' then SB2
                       when 'GB' then SB1
           end as SB2,
           sbx, colx
      from your_table
    

    Alternatively, You can also use UNION ALL as follows:

    select id, market, ck, sb1, sb2, sbx, colx from your_table where market = 'US'
    UNION ALL
    select id, market, ck, sb2, sb1, sbx, colx from your_table where market = 'GB'