I have a database for tracking claims payments. There's a table for claims claim
, a table for monthly payments claim_month
and a table defining each month month
. month
has each entry in order so that if month_id[1] > month_id[2]
then the second figure is earlier than the first figure.
Using the query (the randomisation of paid_to_date
is added for privacy purposes):
SELECT
claim.claim_id,
m.month_id,
claim_month_id,
IF (claim_month.paid_to_date IS NOT NULL, ROUND(RAND(1) * 100), NULL) AS paid_to_date
FROM
claim
INNER JOIN ( SELECT DISTINCT month_id FROM claim_month ) AS m
LEFT JOIN claim_month ON claim.claim_id = claim_month.claim_id
AND m.month_id = claim_month.month_id
I get the following data.
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1004, 8584, 41);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1005, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1006, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (25, 1007, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1004, 8580, 87);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1005, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1006, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (21, 1007, NULL, NULL);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (5, 1004, 8564, 14);
INSERT INTO ``(`claim_id`, `month_id`, `claim_month_id`, `paid_to_date`) VALUES (5, 1005, 8627, 9);
From here, I need to replace NULLs with the latest non-null observation for each claim_id
.
Since I'm using MariaDB/MYSQL, the LAG function doesn't allow for ignoring NULLs which is unfortunate since it appears to be perfect.
I've also looked into using COALESCE and partitioning it, but that doesn't seem to be allowed either.
I've also looked into using user defined functions however I'm using multiple data types and can't seem to work out how to define a function that doesn't require setting the output data type.
I've spent the whole morning looking through previous questions however most of them are for PostgresSQL which isn't particularly helpful in this context. What am I missing?
I've worked out a solution, but I'm not convinced it's the best. I suspect that for larger databases, this would be quite demanding. It works in the meantime however.
I've essentially joined the table onto itself repeatedly wherever a record is earlier and on the same claim using something similar to the following:
SELECT
b.claim_id,
b.month_id,
b.claim_month_id,
claim_month.claim_month_id AS claim_month_id_latest
FROM
(SELECT
a.claim_id,
a.month_id,
a.claim_month_id,
MAX(claim_month.month_id) AS source_month_id
FROM
(
SELECT
claim.claim_id,
m.month_id,
claim_month_id
FROM
claim
INNER JOIN ( SELECT DISTINCT month_id FROM claim_month ) AS m
LEFT JOIN claim_month ON claim.claim_id = claim_month.claim_id
AND m.month_id = claim_month.month_id
) AS a
LEFT JOIN claim_month ON a.claim_id = claim_month.claim_id
AND a.month_id >= claim_month.month_id
GROUP BY
a.claim_id, a.month_id) AS b
LEFT JOIN claim_month ON b.claim_id = claim_month.claim_id AND b.source_month_id = claim_month.month_id
ORDER BY b.claim_id, b.month_id