I have a table that consists of vehicle plates, where all the plates have always 7 characters. Sometimes a plate was miswritten and so my SELECT query must be able to consider that "AAU1234" and "AAV1234" must be the same vehicle. I don't want to create rules between specific characters, like "U" and "V" or "I" and "1", but a rule able to group plates where 6 of the 7 characters are equal and in the same position.
In this case, ids 1, 2 and 5 should appear only once.
It would be great if, when grouped, all the grouped plates were informed, concatened in another column.
More information: This is a big table and currently identical plates are grouped when inside the same date-time range of 15 minutes.
My query looks like this:
select plate, floor(unix_timestamp(date)/(15 * 60)) as timekey
from table
group by plate, timekey
order by date desc
Following the image example above, my goal is to group ids 1, 2 and 5 in the same row, since id 1 and id 2 matches 6 of 7 strings and so id 1 and id 5.
The result could be something like:
Or:
The final date information is not important here, the most important thing here is to group the similar plates.
Update1 Questions and answers on "staging ground":
RafaelZG
This table receives data from a plate recognition camera software and I have a front end that shows all the entries, grouping identical plates in the same time range, because a vehicle can have 1, 2 or 10 photos. The problem is that sometimes the software can't read a characater correctly and it generates separate entries in my front end, so my goal is to reduce this problem as much as possible, grouping these plates where only 1 character was wrongly registered. This rule doesn't need to solve 100% of the wrong entries.
Shadow
Row with id=5 differs by only one character from id=1, but differs by 2 characters from row with id=2. So, these rows would only be grouped together if the starting row is id=1. How do you determine which row is the starting one from grouping perspective? Btw, you should look at levenshtein distance, but this question is a way more complicated than you think and mysql may not be the best tool to accomplish your task!
RafaelZG
@shadow Great point! Things start to get even more complicated, but since ids 1 and 2 are grouped, and id 5 has 6 char matches with id 1, it should be also added to the group. The limiting point should be a date range, so the query would need run the entire table infinitely.
RafaelZG
I'll try to answer again more objectively: if id 1 and id 2 are grouped and id 5 is 6-char similar to id 1, then all the 3 should be grouped. eg: ABC1234, ABG1234, ABO1234, ABOI234, ABOI284 should be all grouped, since each of them is 6-char-similar to at least another one.
See example with test data. The data is somewhat expanded compared to the examples in the question.
create table test (id int, plate varchar(7));
insert into test values
(1,'AAU1234')
,(2,'AAV1234')
,(3,'BKP5678')
,(4,'CMD9081')
,(5,'A4U1234')
,(6,'ABC1234')
,(7,'ABG1234')
,(8,'ABO1234')
,(9,'ABOI234')
,(10,'ABOI284')
,(11,'ABGI234')
,(12,'ABGI284')
,(14,'CMD9031')
;
First, let's simple compare plate
's and find the rows, where 6 characters match. In JOIN condition and (t1.s1=t2.s1 or t1.s2=t2.s2)
- at least 1 must match.
Wait for matchn=6
with recursive
tn as(
select *,substring(plate,1,1) s1,substring(plate,2,1) s2,substring(plate,3,1) s3
,substring(plate,4,1) s4,substring(plate,5,1) s5,substring(plate,6,1) s6,substring(plate,7,1) s7
from test
)
,cmp as(
select t1.id,t1.plate,t2.id id2,t2.plate plate2
select t1.id,t1.plate,t2.id id2,t2.plate plate2
,case when (t1.s1=t2.s1) then 1 else 0 end +case when (t1.s2=t2.s2) then 1 else 0 end
+case when (t1.s3=t2.s3) then 1 else 0 end +case when (t1.s4=t2.s4) then 1 else 0 end
+case when (t1.s5=t2.s5) then 1 else 0 end +case when (t1.s6=t2.s6) then 1 else 0 end
+case when (t1.s7=t2.s7) then 1 else 0 end matchn
from tn t1
left join tn t2 on t1.id<>t2.id and (t1.s1=t2.s1 or t1.s2=t2.s2)
)
Output (part of result)
id | plate | id2 | plate2 | matchn |
---|---|---|---|---|
1 | AAU1234 | 2 | AAV1234 | 6 |
1 | AAU1234 | 5 | A4U1234 | 6 |
4 | CMD9081 | 14 | CMD9031 | 6 |
6 | ABC1234 | 7 | ABG1234 | 6 |
6 | ABC1234 | 8 | ABO1234 | 6 |
7 | ABG1234 | 8 | ABO1234 | 6 |
7 | ABG1234 | 11 | ABGI234 | 6 |
8 | ABO1234 | 9 | ABOI234 | 6 |
9 | ABOI234 | 10 | ABOI284 | 6 |
9 | ABOI234 | 11 | ABGI234 | 6 |
10 | ABOI284 | 12 | ABGI284 | 6 |
11 | ABGI234 | 12 | ABGI284 | 6 |
This table can be considered as a description of a graph.
The resulting graph is not oriented because we have set the condition (t1.id <> t2.id ).
(6)ABC1234 (1)AAU1234
/ \ / \
C->G C->O A->4 U->V
/ \ / \
(7)ABG1234--G->O--(8)ABO1234 / (2)AAV1234
\ (5)A4U1234
1->I
\
(9)ABOI234 (4)CMD9081
/ \ |
O->G 3->8 8->3
/ \ |
/ (10)ABOI284 (14)CMD9031
/ \
/ O->G
/ \
(11)ABGI234-3->8-(12)ABGI284
Then we recursively traverse the directed graph to find all nodes starting from the vertex of the graph.
,r as(
select distinct 0 lvl,id,plate p0,plate,id id2,plate plate2
,cast('' as char(1000)) as path
from cmp
where matchn=6
union all
select lvl+1,r.id,p0,r.plate2,t.id2 id2,t.plate2
,concat(r.path,',',r.plate) as path
from r inner join cmp t on t.plate=r.plate2 -- find_in_set(t.plate,r.list)>0
where t.matchn=6
and find_in_set(t.plate,r.path)=0
-- and lvl<9 -- for debug only
)
,d as(
select distinct id,p0,id2,plate2
from r
)
,chains as(
select min(id2) groupid,id,p0 plate
,group_concat(id2 order by id2) ids
,group_concat(plate2 order by id2) plates
from d
group by id,p0
)
Chains output
groupid | id | plate | ids | plates |
---|---|---|---|---|
1 | 1 | AAU1234 | 1,2,5 | AAU1234,AAV1234,A4U1234 |
1 | 2 | AAV1234 | 1,2,5 | AAU1234,AAV1234,A4U1234 |
1 | 5 | A4U1234 | 1,2,5 | AAU1234,AAV1234,A4U1234 |
4 | 4 | CMD9081 | 4,14 | CMD9081,CMD9031 |
4 | 14 | CMD9031 | 4,14 | CMD9081,CMD9031 |
6 | 6 | ABC1234 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
6 | 7 | ABG1234 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
6 | 8 | ABO1234 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
6 | 9 | ABOI234 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
6 | 10 | ABOI284 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
6 | 11 | ABGI234 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
6 | 12 | ABGI284 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
Now we can merge the rows or just assign a group to each row.
,newGr as(
select t.*,coalesce(c.groupid,t.id) groupid
,c.ids,c.plates
from test t
left join chains c on t.id=c.id
)
-- select * from cmp order by id,id2;
-- select * from r order by id,lvl,id2;
-- select * from d order by id,id2;
-- select * from chains order by groupid;
-- select * from newGr order by id;
select groupid,min(id) id -- ,grPlate
,group_concat(id) ids
,group_concat(plate) plates
from newGr
group by groupid
Output
id | grPlate | ids | plates |
---|---|---|---|
1 | AAU1234 | 1,2,5 | AAU1234,AAV1234,A4U1234 |
6 | ABC1234 | 6,7,8,9,10,11,12 | ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284 |
3 | BKP5678 | 3 | BKP5678 |
4 | CMD9081 | 4,14 | CMD9081,CMD9031 |
New groups for rows
id | plate | grPlate |
---|---|---|
1 | AAU1234 | AAU1234 |
2 | AAV1234 | AAU1234 |
3 | BKP5678 | BKP5678 |
4 | CMD9081 | CMD9081 |
5 | A4U1234 | AAU1234 |
6 | ABC1234 | ABC1234 |
7 | ABG1234 | ABC1234 |
8 | ABO1234 | ABC1234 |
9 | ABOI234 | ABC1234 |
10 | ABOI284 | ABC1234 |
11 | ABGI234 | ABC1234 |
12 | ABGI284 | ABC1234 |
14 | CMD9031 | CMD9081 |