Search code examples
mysqlsqlnullsql-viewlocf

Removing NULLs in sequential data - MYSQL


I have a database for tracking claims payments. There's a table for claims claim, a table for monthly payments claim_month and a table defining each month month. month has each entry in order so that if month_id[1] > month_id[2] then the second figure is earlier than the first figure.

Using the query (the randomisation of paid_to_date is added for privacy purposes):

SELECT
claim.claim_id,
m.month_id,
claim_month_id,
IF (claim_month.paid_to_date IS NOT NULL, ROUND(RAND(1) * 100), NULL) AS paid_to_date
FROM
    claim
    INNER JOIN ( SELECT DISTINCT month_id FROM claim_month ) AS m
    LEFT JOIN claim_month ON claim.claim_id = claim_month.claim_id 
    AND m.month_id = claim_month.month_id

I get the following data.

INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1004, 8584, 41);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1005, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1006, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1007, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1004, 8580, 87);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1005, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1006, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1007, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (5, 1004, 8564, 14);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (5, 1005, 8627, 9);

Data visualised

From here, I need to replace NULLs with the latest non-null observation for each claim_id.

  • Since I'm using MariaDB/MYSQL, the LAG function doesn't allow for ignoring NULLs which is unfortunate since it appears to be perfect.

  • I've also looked into using COALESCE and partitioning it, but that doesn't seem to be allowed either.

  • I've also looked into using user defined functions however I'm using multiple data types and can't seem to work out how to define a function that doesn't require setting the output data type.

I've spent the whole morning looking through previous questions however most of them are for PostgresSQL which isn't particularly helpful in this context. What am I missing?


Solution

  • I've worked out a solution, but I'm not convinced it's the best. I suspect that for larger databases, this would be quite demanding. It works in the meantime however.

    I've essentially joined the table onto itself repeatedly wherever a record is earlier and on the same claim using something similar to the following:

    SELECT 
        b.claim_id,
        b.month_id,
        b.claim_month_id,
        claim_month.claim_month_id AS claim_month_id_latest
    
    FROM
    
    (SELECT
        a.claim_id,
        a.month_id,
        a.claim_month_id,
        MAX(claim_month.month_id) AS source_month_id
    
    FROM
        (
        SELECT
            claim.claim_id,
            m.month_id,
            claim_month_id
        FROM
            claim
            INNER JOIN ( SELECT DISTINCT month_id FROM claim_month ) AS m
            LEFT JOIN claim_month ON claim.claim_id = claim_month.claim_id 
            AND m.month_id = claim_month.month_id 
    
        ) AS a
        LEFT JOIN claim_month ON a.claim_id = claim_month.claim_id 
                                                    AND a.month_id >= claim_month.month_id
    
    GROUP BY
        a.claim_id, a.month_id) AS b
        LEFT JOIN claim_month ON b.claim_id = claim_month.claim_id AND b.source_month_id = claim_month.month_id
    
    
    ORDER BY b.claim_id, b.month_id