Search code examples
mysqlmariadb

How can I group results where string has the same 6 characters from a total of 7 in MySQL/MariaDB?


I have a table that consists of vehicle plates, where all the plates have always 7 characters. Sometimes a plate was miswritten and so my SELECT query must be able to consider that "AAU1234" and "AAV1234" must be the same vehicle. I don't want to create rules between specific characters, like "U" and "V" or "I" and "1", but a rule able to group plates where 6 of the 7 characters are equal and in the same position.

eg.: example of similar plates

In this case, ids 1, 2 and 5 should appear only once.

It would be great if, when grouped, all the grouped plates were informed, concatened in another column.

More information: This is a big table and currently identical plates are grouped when inside the same date-time range of 15 minutes.

My query looks like this:

select plate, floor(unix_timestamp(date)/(15 * 60)) as timekey 
from table 
group by plate, timekey 
order by date desc

Following the image example above, my goal is to group ids 1, 2 and 5 in the same row, since id 1 and id 2 matches 6 of 7 strings and so id 1 and id 5.

The result could be something like:

desirable result

Or:

desirable result

The final date information is not important here, the most important thing here is to group the similar plates.

Update1 Questions and answers on "staging ground":

RafaelZG
This table receives data from a plate recognition camera software and I have a front end that shows all the entries, grouping identical plates in the same time range, because a vehicle can have 1, 2 or 10 photos. The problem is that sometimes the software can't read a characater correctly and it generates separate entries in my front end, so my goal is to reduce this problem as much as possible, grouping these plates where only 1 character was wrongly registered. This rule doesn't need to solve 100% of the wrong entries.

Shadow
Row with id=5 differs by only one character from id=1, but differs by 2 characters from row with id=2. So, these rows would only be grouped together if the starting row is id=1. How do you determine which row is the starting one from grouping perspective? Btw, you should look at levenshtein distance, but this question is a way more complicated than you think and mysql may not be the best tool to accomplish your task!

RafaelZG
@shadow Great point! Things start to get even more complicated, but since ids 1 and 2 are grouped, and id 5 has 6 char matches with id 1, it should be also added to the group. The limiting point should be a date range, so the query would need run the entire table infinitely.

RafaelZG
I'll try to answer again more objectively: if id 1 and id 2 are grouped and id 5 is 6-char similar to id 1, then all the 3 should be grouped. eg: ABC1234, ABG1234, ABO1234, ABOI234, ABOI284 should be all grouped, since each of them is 6-char-similar to at least another one.


Solution

  • See example with test data. The data is somewhat expanded compared to the examples in the question.

    create table test (id int, plate varchar(7));
    insert into test values
     (1,'AAU1234')
    ,(2,'AAV1234')
    ,(3,'BKP5678')
    ,(4,'CMD9081')
    ,(5,'A4U1234')
    ,(6,'ABC1234')
    ,(7,'ABG1234')
    ,(8,'ABO1234')
    ,(9,'ABOI234')
    ,(10,'ABOI284')
    ,(11,'ABGI234')
    ,(12,'ABGI284')
    ,(14,'CMD9031')
    ;
    

    First, let's simple compare plate's and find the rows, where 6 characters match. In JOIN condition and (t1.s1=t2.s1 or t1.s2=t2.s2) - at least 1 must match.
    Wait for matchn=6

    with recursive
    tn as(
      select *,substring(plate,1,1) s1,substring(plate,2,1) s2,substring(plate,3,1) s3
        ,substring(plate,4,1) s4,substring(plate,5,1) s5,substring(plate,6,1) s6,substring(plate,7,1) s7
      from test
    )
    ,cmp as(
    select t1.id,t1.plate,t2.id id2,t2.plate plate2
    select t1.id,t1.plate,t2.id id2,t2.plate plate2
      ,case when (t1.s1=t2.s1) then 1 else 0 end +case when (t1.s2=t2.s2) then 1 else 0 end
      +case when (t1.s3=t2.s3) then 1 else 0 end +case when (t1.s4=t2.s4) then 1 else 0 end
      +case when (t1.s5=t2.s5) then 1 else 0 end +case when (t1.s6=t2.s6) then 1 else 0 end
      +case when (t1.s7=t2.s7) then 1 else 0 end matchn
    from tn t1
    left join tn t2 on t1.id<>t2.id and (t1.s1=t2.s1 or t1.s2=t2.s2)
    )
    

    Output (part of result)

    id plate id2 plate2 matchn
    1 AAU1234 2 AAV1234 6
    1 AAU1234 5 A4U1234 6
    4 CMD9081 14 CMD9031 6
    6 ABC1234 7 ABG1234 6
    6 ABC1234 8 ABO1234 6
    7 ABG1234 8 ABO1234 6
    7 ABG1234 11 ABGI234 6
    8 ABO1234 9 ABOI234 6
    9 ABOI234 10 ABOI284 6
    9 ABOI234 11 ABGI234 6
    10 ABOI284 12 ABGI284 6
    11 ABGI234 12 ABGI284 6

    This table can be considered as a description of a graph.
    The resulting graph is not oriented because we have set the condition (t1.id <> t2.id ).

           (6)ABC1234                       (1)AAU1234
            /       \                          /  \
          C->G      C->O                     A->4 U->V
          /           \                      /      \
    (7)ABG1234--G->O--(8)ABO1234            /   (2)AAV1234   
                             \         (5)A4U1234
                             1->I
                               \
                          (9)ABOI234                  (4)CMD9081
                              /    \                          |
                            O->G   3->8                      8->3
                            /        \                        |
                           / (10)ABOI284             (14)CMD9031
                          /         \
                         /          O->G
                        /              \ 
                 (11)ABGI234-3->8-(12)ABGI284
    
    

    Then we recursively traverse the directed graph to find all nodes starting from the vertex of the graph.

    ,r as(
      select distinct 0 lvl,id,plate p0,plate,id id2,plate plate2
        ,cast('' as char(1000)) as path
      from cmp
      where  matchn=6 
    
      union all
    
      select lvl+1,r.id,p0,r.plate2,t.id2 id2,t.plate2
        ,concat(r.path,',',r.plate) as path
      from r inner join cmp t on t.plate=r.plate2 -- find_in_set(t.plate,r.list)>0
      where t.matchn=6 
        and  find_in_set(t.plate,r.path)=0 
        -- and lvl<9 -- for debug only
    )
    ,d as(
      select distinct id,p0,id2,plate2
      from r
    )
    ,chains as(
      select min(id2) groupid,id,p0 plate
         ,group_concat(id2 order by id2) ids
         ,group_concat(plate2 order by id2) plates
      from d
      group by id,p0
    )
    

    Chains output

    groupid id plate ids plates
    1 1 AAU1234 1,2,5 AAU1234,AAV1234,A4U1234
    1 2 AAV1234 1,2,5 AAU1234,AAV1234,A4U1234
    1 5 A4U1234 1,2,5 AAU1234,AAV1234,A4U1234
    4 4 CMD9081 4,14 CMD9081,CMD9031
    4 14 CMD9031 4,14 CMD9081,CMD9031
    6 6 ABC1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
    6 7 ABG1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
    6 8 ABO1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
    6 9 ABOI234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
    6 10 ABOI284 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
    6 11 ABGI234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
    6 12 ABGI284 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284

    Now we can merge the rows or just assign a group to each row.

    ,newGr as(
    select t.*,coalesce(c.groupid,t.id) groupid
      ,c.ids,c.plates 
    from test t
    left join chains c on t.id=c.id
    )
    --   select * from cmp order by id,id2;
    --   select * from r order by id,lvl,id2;
    --  select * from d order by id,id2;
    --  select * from chains order by groupid;
    --  select * from newGr order by id;
    select groupid,min(id) id -- ,grPlate
      ,group_concat(id) ids
      ,group_concat(plate) plates
    from newGr
    group by groupid
    

    Output

    id grPlate ids plates
    1 AAU1234 1,2,5 AAU1234,AAV1234,A4U1234
    6 ABC1234 6,7,8,9,10,11,12 ABC1234,ABG1234,ABO1234,ABOI234,ABOI284,ABGI234,ABGI284
    3 BKP5678 3 BKP5678
    4 CMD9081 4,14 CMD9081,CMD9031

    New groups for rows

    id plate grPlate
    1 AAU1234 AAU1234
    2 AAV1234 AAU1234
    3 BKP5678 BKP5678
    4 CMD9081 CMD9081
    5 A4U1234 AAU1234
    6 ABC1234 ABC1234
    7 ABG1234 ABC1234
    8 ABO1234 ABC1234
    9 ABOI234 ABC1234
    10 ABOI284 ABC1234
    11 ABGI234 ABC1234
    12 ABGI284 ABC1234
    14 CMD9031 CMD9081

    Demo
    and old Demo