Though my post is similar to this, I still feel this is in a way different from that one.
I have 2 CSV files.
File A File B
+-------------------------------------------------------------------+
| Name | Country Name | Country |
+-------------------------------------------------------------------+
| Ferrari | Italy Jaguar | British |
| Mercedes | Germany Chevrolet | America |
| Jaguar | British Bugatti | Italy |
| Nissan | Japan Tata | India |
| Chevrolet | USA Nissan | Japan |
+-------------------------------------------------------------------+
The above is used for illustration only. In general I have a lot more rows and columns in both of the files but their structure is same.
I was asked to perform a row level comparison of all the columns efficiently. So, instead of doing it programatically, I proposed using HSQLDB to
CREATE TEXT TABLE
and SET SOURCE
with the files respectively and then do a EXCEPT
operation between the files. I did code it and it works like a charm. Below is the SQL part that I coded to achieve the same.
CREATE TABLE COMPARE_TABLE AS (SELECT SRC.*, 'SRC-TGT' compare_order FROM TABLEA SRC EXCEPT SELECT TGT.*, 'SRC-TGT' compare_order FROM TABLEB TGT) WITH DATA;
INSERT INTO COMPARE_TABLE SELECT TGT.*, 'TGT-SRC' compare_order FROM TABLEB TGT EXCEPT SELECT SRC.*, 'TGT-SRC' compare_order FROM TABLEA SRC;
This gave me a result table which looks like the following (considering the sample data above):
COMPARE TABLE
+------------+----------+---------------+
| Name | Country | Compare_order |
+------------+----------+---------------+
| Ferrari | Italy | SRC-TGT |
| Mercedes | Germany | SRC-TGT |
| Chevrolet | USA | SRC-TGT |
| Chevrolet | America | TGT-SRC |
| Bugatti | Italy | TGT-SRC |
| Tata | India | TGT-SRC |
+------------+----------+---------------+
From here, I need to identify the reason for every row as to what did not match? Atleast, broadly, I would like to classify it into 3 categories:
Finally, I want my table to look like the below:
COMPARE_TABLE
+------------+----------+---------------+------------------------+
| Name | Country | Compare_order | Failure_Reason |
+------------+----------+---------------+------------------------+
| Ferrari | Italy | SRC-TGT | New at Source |
| Mercedes | Germany | SRC-TGT | New at Source |
| Chevrolet | USA | SRC-TGT | Country value mismatch |
| Chevrolet | America | TGT-SRC | Country value mismatch |
| Bugatti | Italy | TGT-SRC | New at Target |
| Tata | India | TGT-SRC | New at Target |
+------------+----------+---------------+------------------------+
How do I go about doing this? Can we even do this in SQL?
Any help is greatly appreciated.
You can do the basic comparison as:
select name, country,
(case when sum(which = 'src') > 0 and sum(which = 'tgt') then 'DROPPED'
when sum(which = 'src') = 0 and sum(which = 'tgt') then 'NEW'
end) as OP
from ((select 'src' as which, name, country
from tableA
) union all
(select 'tgt', name, country
from tableB
)
) ab
group by name, country;
But that doesn't give you column-by-column comparisons. That is a bit more difficult. Let me assume that Name
is unique, so it can be used as a key. The following does the comparison, but produces one row per name:
select name,
(case when src.country is null then dest.country
when tgt.country is null then tgt.country
when src.country = tgt.country then dest.country
else (src.country, '-->', tgt.country)
end) as country,
(case when src.country is null then 'new'
when tgt.country is null then 'dropped'
when src.country = tgt.country then 'same'
else 'changed'
end) as country,
from (select name from TableA union select name from TableB
) names left outer join
TableA src
on names.name = src.name left outer join
TableB tgt
on names.name = tgt.name;
Getting multiple rows for each name when the value in a column changes seems more difficult, although it too would be possible.