I have a table with duplicates and I want to partition by the ID and select only the row with the most information (where the most fields contain values).
+----+------+------+-------+---------+-------+
| ID | Name | City | Zip | Address | Phone |
+----+------+------+-------+---------+-------+
| 1 | Joe | | | | |
| 1 | Joe | DC | 11111 | | |
| 2 | Pete | NY | | | |
| 2 | Pete | NY | 10000 | | 202- |
| 3 | Max | | | | |
| 3 | Max | | | | |
| 4 | Sean | MIA | | | |
| 4 | Sean | MIA | | 1 blvd | |
| 4 | Sean | | 12345 | | 305- |
| | | | | | |
+----+------+------+-------+---------+-------+
This is my goal:
+----+------+------+-------+---------+-------+---------+
| ID | Name | City | Zip | Address | Phone | Row_num |
+----+------+------+-------+---------+-------+---------+
| 1 | Joe | DC | 11111 | | | 1 |
| 2 | Pete | NY | 10000 | | 202- | 1 |
| 3 | Max | | | | | 1 |
| 4 | Sean | MIA | | 1 blvd | | 1 |
| | | | | | | |
+----+------+------+-------+---------+-------+---------+
For Joe it is obvious that I want the second row with where the city and zip information is given.
For Pete I also want to display the second record because it contains more information.
For Max it doesn't matter what row I choose because both records have the same values.
For Sean I can either take the second or third row because the second record has 3 fields which contain value (name, city, address) and the third record has also three filled fields (name, zip, phone). So It doesn't matter which record I want to select from Sean.
How can I partition my table and select the row with the most information about each person?
If the columns are all strings, you can simplify the logic using apply
:
select t.*
from (select t.*,
row_number() over (partition by t.id order by v.cnt desc) as seqnum
from t cross apply
(select count(*)
from (values (name), (city), (zip), (address), (phone)) v(col)
where col is not null
) v(cnt)
) t
where seqnum = 1;
If you want to adapt this for empty strings, you can change the where
to where col is not null and col <> ''
.