Search code examples
sqlpostgresqldeduplication

SQL: how to select the row with most known values?


I have the table of users (username, gender, date_of_birth, zip) where the user's id is permanent but the user could be registered many times in the past where sometimes he filled out all the data and sometimes not. Besides that, he could change the residency (in this case zip can change).

So the query

SELECT username, sex, date_birth, zip FROM users_log WHERE username IN('user1', 'user2', 'user3')

returns the following result:

"user1";"M";"1982-10-04 00:00:00";"6320"
"user2";"";"";"1537"
"user3";"";"";"1537"
"user3";"";"";"1000"
"user3";"";"";"1000"
"user3";"";"1979-05-29 00:00:00";"1000"
"user3";"";"";"1537"
"user3";"";"1979-05-29 00:00:00";"1000"
"user1";"";"";"1000"
"user3";"";"";"1537"

In this case the user1 has changed the residence; the zip code changed; and the second row that 'belongs' to him does not contain demographic data. User3 has also multiple records and only two records contain demographic data.

What I would like to do is to bind users with the row that contains the most data about him and consider the zip included in the row with the most known values. Does anyone know how to write the appropriate query?

Thanks!


Solution

  • It's gonna be painful; very painful.

    Your question isn't clear about this issue, but I'm assuming that the 'user id' you're referring to is the user name. There are consequential modifications to make if that's wrong.

    As with any complex query, build it up in stages.

    Stage 1: How many non-null fields are there per record?

    SELECT username, sex, date_of_birth, zip,
           CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
           CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
           CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
      FROM users_log
    

    Stage 2: Which is the maximum such number of fields for a given user name?

    SELECT username, MAX(num_non_null_fields) AS num_non_null_fields
      FROM (SELECT username, sex, date_of_birth, zip,
                   CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
                   CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
                   CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
              FROM users_log
           ) AS u
     GROUP BY username
    

    Stage 3: Select (all) the rows for a given user with that maximal number of non-null fields:

    SELECT u.username, u.sex, u.date_of_birth, u.zip
      FROM (SELECT username, MAX(num_non_null_fields) AS num_non_null_fields
              FROM (SELECT username, sex, date_of_birth, zip,
                           CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
                           CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
                           CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
                      FROM users_log
                   ) AS u
             GROUP BY username
           ) AS v
      JOIN (SELECT username, sex, date_of_birth, zip,
                   CASE WHEN sex           IS NULL THEN 0 ELSE 1 END +
                   CASE WHEN date_of_birth IS NULL THEN 0 ELSE 1 END +
                   CASE WHEN zip           IS NULL THEN 0 ELSE 1 END AS num_non_null_fields
              FROM users_log
           ) AS u
        ON u.username = v.username AND u.num_non_null_fields = v.num_non_null_fields;
    

    Now, if someone has multiple rows with (say) all three fields filled in, then all those rows will be returned. However, you've not specified any criteria by which to choose between those rows.

    The basic techniques here can be adapted to any changed requirements. The key is to build and test the sub-queries as you go.

    None of this SQL has been near a DBMS; there could be bugs in it.

    You've not specified which DBMS you are using. However, it seems that Oracle won't like the AS notation used for table aliases, though it has no problem with AS on column aliases. If you're using any other DBMS, you shouldn't have to worry about that minor eccentricity.