Search code examples
sqlpostgresqldistinct-on

Get distinct information across many fields some of which are NULL


I have a table with just over 65 million rows and 140 columns. The data comes from several sources and is submitted at least every month.

I look for a quick way to grab specific fields from this data only where they are unique. Thing is, I want to process all the information to link which invoice was sent with which identifying numbers and it was sent by whom. Issue is, I don't want to iterate over 65 million records. If I can get distinct values, then I will only have to process say 5 million records as opposed to 65 million. See below for a description of the data and SQL Fiddle for a sample

If say a client submits an invoice_number linked to passport_number_1, national_identity_number_1 and driving_license_1 every month, I only want one row where this appears. i.e. the 4 fields have got to be unique

If they submit the above for 30 months then on the 31st month they send the invoice_number linked to passport_number_1, national_identity_number_2 and driving_license_1, I want to pick this row also since the national_identity field is new hence the whole row is unique

  • By linked to I mean they appear on the same row
  • For all fields its possible to have Null occurring at one point.
  • The 'pivot/composite' columns are the invoice_number and submitted_by. If any of those aren't there, drop that row
  • I also need to include the database_id with the above data. i.e. the primary_id which is auto generated by the postgresql database
  • The only fields that don't need to be returned are the other_column and yet_another_column. Remember the table has 140 columns so don't need them
  • With the results, create a new table that will hold this unique records

See this SQL fiddle for an attempt to recreate the scenario.

From that fiddle, I'd expect a result like:

  • Row 1, 2 & Row 11: Only one of them shall be kept as they are exactly the same. Preferably the row with the smallest id.
  • Row 4 and Row 9: One of them would be dropped as they are exactly the same.
  • Row 5, 7, & 8: Would be dropped since they are missing either the invoice_number or submitted_by.
  • The result would then have Row (1, 2 or 11), 3, (4 or 9), 6 and 10.

Solution

  • To get one representative row (with additional fields) from a group with the four distinct fields:

    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
    ;
    

    Note that it is unpredictable which row exactly is returned unless you specify an ordering (documentation on distinct)

    Edit:

    To order this result by id simply adding order by id to the end doesn't work, but it can be done by eiter using a CTE

    with distinct_rows as (
        SELECT 
        distinct on (
          invoice_number
          , passport_number
          , national_id_number
          , driving_license_number
          -- ...
        )
          * -- specify the columns you want here
        FROM my_table
        where invoice_number is not null
        and submitted_by is not null
    )
    select *
    from distinct_rows
    order by id;
    

    or making the original query a subquery

    select *
    from (
        SELECT 
        distinct on (
          invoice_number
          , passport_number
          , national_id_number
          , driving_license_number
          -- ...
        )
          * -- specify the columns you want here
        FROM my_table
        where invoice_number is not null
        and submitted_by is not null
    ) t
    order by id;