Background:
First, I have a schema with a table named patients
with which I am concerned with patient_id
and alerts
(alerts is a string of characters, where each character represents some arbitrary value/meaning). Second, each 'patient' is part of a group [family] based solely on the first 6 digits on their patient_id. Also, there are a few third parties that rely on this database; I did not design nor can I change this schema/datamodel and I cannot migrate away from MySQL.
Here is a fiddle with the data model
Challenge:
Now, I need to find occurrences where patients have alerts containing !
, @
, #
, %
, ^
, or &
symbols and their family members do not. My first thought was to gather all patients that have alerts containing these symbols, remove the last digit in each patient_id and then group by this value. Now I have a list of (for all intents and purposes) 'group_ids.' Finally, I need to extend the list to contain each of the groups family members and their respective alerts strings.
Here is what I have so far:
Query #1:
SELECT p.patient_id, p.name_first, p.name_last, p.alerts
FROM patients p
INNER JOIN (SELECT SUBSTRING(patient_id, 1, CHAR_LENGTH(patient_id) - 1) AS group_id
FROM patients
WHERE patient_id BETWEEN 1000000 AND 7999999
AND (alerts like '%!%'
OR alerts like '%@%'
OR alerts like '%#%'
OR alerts like '%\%%'
OR alerts like '%^%'
OR alerts like '%&%')
GROUP BY group_id) g
ON p.patient_id LIKE CONCAT(g.group_id, '%')
ORDER BY p.patient_id
LIMIT 30000;
Fiddle ~ NOTICE: fiddle is not an accurate representation of the problem as included table only has 28 records.
Recordset: 80,000 ~ Results: 2188 ~ Duration: 14.321 sec ~ Fetch: 0.00 sec ~ Total: 14.321 sec
Query #2:
SELECT p.patient_id, p.name_first, p.name_last, p.alerts
FROM patients p
JOIN (SELECT DISTINCT LEFT(patient_id, 6) AS group_id
FROM patients
WHERE patient_id BETWEEN 1000000 AND 7999999
AND alerts REGEXP '[!@#%^&]') g
ON p.patient_id LIKE CONCAT(g.group_id, '%')
ORDER BY p.patient_id
LIMIT 30000;
Fiddle ~ NOTICE: fiddle is not an accurate representation of the problem as included table only has 28 records.
Recordset: 80,000 ~ Results: 2188 ~ Duration: 4.259 sec ~ Fetch: 5.663 sec ~ Total: 9.992 sec
EDIT: after adding name_first, name_last, alerts and order by clause, I found this query took exactly the same amount of time as the first.
Problem:
The list that I get back is accurate but, not only will it take extra processing (I plan to do this with PHP), but also it takes 14 seconds!
If anyone has a better... or can at least point in the direction of a better and more efficient solution, please enlighten me. Thanks in advance.
Extra Credit: any tips on the PHP algo to solve the aforementioned problem with the given data - forget semantics, just a formula will do.
I found an efficient enough solution as follows:
SELECT p.patient_id, name_first, name_last, alerts
FROM patients p
JOIN (SELECT DISTINCT LEFT(patient_id, 6) AS group_id
FROM patients
WHERE patient_id BETWEEN 1000000 AND 7999999
AND alerts REGEXP '[!@#%^&]') g
ON LEFT(p.patient_id, 6) = g.group_id /* HERE is the simple magic */
ORDER BY p.patient_id
LIMIT 30000;
Recordset: 80,000 ~ Results: 2188 ~ Duration: 0.312 sec ~ Fetch: 0.062 sec ~ Total: 0.374 sec
Since we know that legitimate patient_ids are 7 digits long, we can determine a patient's 'group_id' by simply using LEFT(patient_id, 6)
instead of the less efficient SUBSTRING(patient_id, 1, CHAR_LENGTH(patient_id) - 1)
(which I now see I could've originally written as SUBSTRING(patient_id, 1, 6)
). Regardless of the method used here, the real savings is with the change to the ON
clause. Instead of comparing patient_id to LIKE CONCAT(group_id, '%')
, why not just make a direct =
comparison to the LEFT 6 digits of patient_id in Table 'p'?
In other words, the nested select is used to find all unique 'groups' where at least one member has one of the desired alert symbols. The main select uses this table to determine all patients belonging to those groups. Essentially, LEFT(patient_id, 6) is == 'group_id'
and we get to keep our index... the only extra overhead is with one LEFT() call per row.
Yet another example of "KISS."
Thank you to everyone for your help!
EDIT: Since I will be using the group_id in my PHP algo, I will add it into the select in an efficient manner:
SELECT g.group_id, RIGHT(p.patient_id, 1) AS sub_id, name_first, name_last, alerts
FROM patients p
JOIN (SELECT DISTINCT LEFT(patient_id, 6) AS group_id
FROM patients
WHERE patient_id BETWEEN 1000000 AND 7999999
AND alerts REGEXP '[!@#%^&]') g
ON LEFT(p.patient_id, 6) = g.group_id
ORDER BY p.patient_id
LIMIT 30000;
HERE is a fiddle! ~ NOTICE: this is not an accurate representation of the solution as included table only has 28 records. See above results on larger dataset.
AND finally, the PHP algo I used to finish processing ~ shoutout to @The Nail:
$cur_group_id = 0;
$members = [];
$symbol = '';
$errs = false;
while($row = $result->fetch_assoc()){
$row['alerts'] = preg_replace('/[^!@#%^&]+/i', '', $row['alerts']);
if($row['group_id'] != $cur_group_id){
if($errs){
foreach($members as $member => $data){
printf('<tr><td>%d%d</td><td>%s</td><td>%s</td><td>%s</td></tr>',
$data['group_id'],
$data['sub_id'],
$data['name_last'],
$data['name_first'],
$data['alerts']);
}
}
/* reset current group */
$cur_group_id = $row['group_id'];
$members = array();
$symbol = $row['alerts'];
$errs = false;
}
$members[] = $row;
if($row['alerts'] != $symbol || strlen($row['alerts']) > 1){
$errs = true;
}
}
Total processing time (including query): .6 sec!!