Search code examples
sqloracleoracle12cstring-matching

How to select the most matching subString to another String


Lets say the full String is

The following example examines the string, looking for the first substring bounded by comas

and the subString is

substing bounded

is there any way that I could check the full string if contains a 90% matching subString using sql

like the word substing bounded and substring bounded in my example

the subString could be a compound of more words so I can't split the full string into words .


Solution

  • First transform your text in a table of words. You'll find a lot as posts to this topic on SO, e.g. here

    You'll have to adjust the list of delimiter characters to extract the words only.

    This is a sample query

     with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
          t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
          t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
          where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
     select * from t3;
    
    COL      
    ----------
    The        
    following  
    example    
    examines
    ...
    

    In the next step your the Levenshtein Distance to get the closes word.

     with t1 as (select 1 rn, 'The following example examines the string, looking for the first substring bounded by comas' col from dual  ),
          t2 as (select  rownum colnum from dual connect by level < 16 /* (max) number of words */),
          t3 as (select t1.rn, t2.colnum, rtrim(ltrim(regexp_substr(t1.col,'[^ ,]+', 1, t2.colnum)))  col  from t1, t2 
          where regexp_substr(t1.col, '[^ ,]+', 1, t2.colnum) is not null)
     select col, str, UTL_MATCH.EDIT_DISTANCE(col, str)  distance
     from t3
     cross join (select 'commas' str from dual)
     order by 3;
    
    COL        STR      DISTANCE
    ---------- ------ ----------
    comas      commas          1 
    for        commas          5 
    examines   commas          6 
    ...
    

    Check the definition of the Levenshtein Distance and define a threshold on the distance to get your candidate words.

    To match independent of the word boundary simple scan through your input and get all substring in a lenth of your match string adjusted for the diferentce e.g. adding some 10%.

    You may limit the candidates by filtering such substrings only that start on the word boundary. The rest ist the same distance calculation.

     with txt as (select  'The following example examines the string, looking for the first substring bounded by comas' txt from dual),
          str as (select  'substing bounded' str from dual),
          t1 as (select  substr(txt, rownum, (select length(str) * 1.1 from str)) substr, /* add 10% length for the match */
                         (select str from str) str 
                 from txt connect by level < (select length(txt) from txt) - (select length(str) from str)) 
     select SUBSTR, STR, 
            UTL_MATCH.EDIT_DISTANCE(SUBSTR, STR)  distance
     from t1
     order by 3;
    
    SUBSTR               STR                DISTANCE
    -------------------- ---------------- ----------
    substring bounded    substing bounded          1 
    ubstring bounded     substing bounded          3 
     substring bounde    substing bounded          3 
    t substring bound    substing bounded          5 
    ...