oracle-database function plsql ref-cursor

Optimal ORACLE function parsing

All,

Whilst I’m not new to SQL it’s been a long long time and I’m very new to Oracle. Please note also I know that the way I have done this is probably the slowest possible solution, but as I said I’m new to this and it worked on the 1000 or so Items I needed to test. The problem is that it DID work so I want to expand this to test 1,000,000’s and I know this is a donkey not a race horse.

Had a small task to “Standardise” a list of company names, so I created a table with known abbreviations

Col1        Col2    Col3    Col4    ….  Col14

Company Co  Cmpy
Limited     Ltd Lim LMT
Etc

I then wrote a function as to select this table into a cursor, and loop through it for each name sent to it and replace any “NON” standard abbreviations with that in Col 1, using the following:

What I have been thinking about is re-writing the code (obviously) and putting it into a package as I believe that that should result in considerable speed and clarity compared to now.

The problem I have is not really improving the looping part of the function but what I need to do from a Select that calls this function to ensure it loaded the STD_GBR_CO_SUFFIX only once and re-uses the cursor.

I've been reading some of the O'Reilly and I think that what I want to do is possible, I just can't figure out how.

I think I need to put this in a package, and use a REF Cursor, and BULK LOAD looks a contender, but the more I read the more confused I get. If someone out there could just could just point me in the right direction I can carry on from there. What I don't want is for someone to just write the solution I want to learn how to do this better not have it done for me.

Thank you all for your assistance in advance.

So if I do a

Select Standardise_Company_Suffix(company_name) AS STD_Company_Name
From VERY_LARGE_TABLE

CREATE OR REPLACE Function Standardise_Company_Suffix(Co_Name IN VARCHAR2)
RETURN varchar2 IS
  stg_Co_Name varchar2(400 byte);
  fmt_S varchar2(20 byte);
  fmt_E varchar2(20 byte);
  parse_Str varchar2(2400 byte);
  parse_char varchar2(4 byte);

CURSOR C1 IS
-- Populate Cursor with Table of Suffixes
select * from STD_GBR_CO_SUFFIX;

BEGIN

parse_char := "s";

Fmt_S := '(^|\'||parse_char||')';

Fmt_E := '(\'||parse_char||'|$)';

stg_Co_Name := upper(co_name);

parse_str := ' ';

   FOR c1pass IN C1 LOOP

   parse_str :='';

   If c1pass.column14 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column14)||Fmt_E;   End If;

   If c1pass.column13 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column13)||Fmt_E;   End If;

   If c1pass.column12 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column12)||Fmt_E;   End If;

   If c1pass.column11 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column11)||Fmt_E;   End If;

   If c1pass.column10 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column10)||Fmt_E;   End If;

   If c1pass.column9 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column9)||Fmt_E;     End If;

   If c1pass.column8 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column8)||Fmt_E;     End If;

   If c1pass.column7 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column7)||Fmt_E;     End If;

   If c1pass.column6 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column6)||Fmt_E;     End If;

   If c1pass.column5 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column5)||Fmt_E;     End If;

   If c1pass.column4 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column4)||Fmt_E;     End If;

   If c1pass.column3 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column3)||Fmt_E;     End If;

   If c1pass.column2 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column2)||Fmt_E;     End If;

   Parse_str := substr(parse_str,2);

   If regexp_instr(stg_Co_Name,parse_str) <> 0 Then

            stg_Co_Name := regexp_REPLACE(stg_Co_Name,parse_str,
                            ' $'||UPPER(c1pass.column1||'$ '));

   Else

            stg_Co_Name := stg_Co_Name;

    End if;    

 END LOOP;

return stg_Co_Name; 

End;

/

Solution

This isn't really the kind of direction you were asking for, but if possible I'd forget about the function - which will at least incur context switching, but probably quite a lot of other overhead - and try to do something in native SQL. I'd also change your look-up table as that isn't very flexible to have multiple columns. I'd have a single standard-abbreviation pair in each row:

STD_VALUE            ABBREVIATION
-------------------- ------------
Bar                  Ba           
Company              Co           
Company              Cmpy         
Foo                  Fo           
Limited              Ltd          
Limited              Lim          
Limited              LMT

This seems to work, at least for some basic test cases, but you would need to be on 11gR2, so you have both the listagg function and recursive subquery factoring (a.k.a recursive CTE or recursive with). I very much applaud the sentiment that you don't want someone to just write the solution, but I had fun figuring it out and it seems a shame to waste it. I've tried to explain what's going on below; hopefully you can learn from that still. And it will probably need some modification so you'll need to understand it...

with p as (
  select ' $'|| upper( std_value) || '$ ' as replace_string,
    listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
      within group (order by length(abbreviation) desc) as pattern,
    rank() over (order by std_value) as rn
  from std_gbr_co_suffix
  group by std_value
),
r (company_name, replacements, rn, std_company_name) as (
  select company_name, 0, 0, company_name from very_large_table
  union all
  select r.company_name, r.replacements + 1, p.rn, 
    regexp_replace(r.std_company_name, p.pattern, p.replace_string)
  from r
  join p on p.rn > r.rn and regexp_like(r.std_company_name, p.pattern)
),
t as (
  select company_name, std_company_name,
    rank() over (partition by company_name order by replacements desc) as rn
  from r
)
select company_name, std_company_name
from t
where rn = 1;

SQL Fiddle demo with a few test cases; be interested to know how it fairs against your 1000-odd test items, both in terms of correctness and performance.

Breaking it down a bit, the first CTE p:

p as (
  select ' $'|| upper( std_value) || '$ ' as replace_string,
    listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
      within group (order by length(abbreviation) desc) as pattern,
    rank() over (order by std_value) as rn
  from std_gbr_co_suffix
  group by std_value
)

... generates the pattern and replacement strings for each distinct std_value (which is column1 in your original table):

REPLACE_STRING                   RN PATTERN                                          
------------------------ ---------- --------------------------------------------------
 $BAR$                            1 (^|\s)Ba(\s|$)                                     
 $COMPANY$                        2 (^|\s)Cmpy(\s|$)|(^|\s)Co(\s|$)                    
 $FOO$                            3 (^|\s)Fo(\s|$)                                     
 $LIMITED$                        4 (^|\s)LMT(\s|$)|(^|\s)Lim(\s|$)|(^|\s)Ltd(\s|$)

The second CTE r is the recursive one:

r (company_name, replacements, rn, std_company_name) as (
  select company_name, 0, 0, company_name from very_large_table
  union all
  select r.company_name, r.replacements + 1, p.rn, 
    regexp_replace(r.std_company_name, p.pattern, p.replace_string)
  from r
  join p on p.rn > r.rn and regexp_like(r.std_company_name, p.pattern)
)

It starts with the original values from your table as the anchor member, and then recursively applies each pattern from p. (The rn in the join is to stop it applying patterns both ways, in case an original name matches more than one; you'd do more work than necessary and end up with duplicate results without that). For my dummy data that gives:

COMPANY_NAME                   REPLACEMENTS         RN STD_COMPANY_NAME                                 
------------------------------ ------------ ---------- --------------------------------------------------
Oracle Co                                 0          0 Oracle Co                                          
Oracle Ltd                                0          0 Oracle Ltd                                         
Oracle Ltd.                               0          0 Oracle Ltd.                                        
Oracle Co Ltd.                            0          0 Oracle Co Ltd.                                     
Oracle Co Ltd Cmpy LMT                    0          0 Oracle Co Ltd Cmpy LMT                             
Oracle Co                                 1          2 Oracle $COMPANY$                                   
Oracle Ltd                                1          4 Oracle $LIMITED$                                   
Oracle Co Ltd Cmpy LMT                    1          2 Oracle $COMPANY$ Ltd $COMPANY$ LMT                 
Oracle Co Ltd Cmpy LMT                    1          4 Oracle Co $LIMITED$ Cmpy $LIMITED$                 
Oracle Co Ltd.                            1          2 Oracle $COMPANY$ Ltd.                              
Oracle Co Ltd Cmpy LMT                    2          4 Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$

You can see that an original name can have several substitutions, and the original value itself; specifically here:

Oracle Co Ltd Cmpy LMT                    0          0 Oracle Co Ltd Cmpy LMT                             
Oracle Co Ltd Cmpy LMT                    1          2 Oracle $COMPANY$ Ltd $COMPANY$ LMT                 
Oracle Co Ltd Cmpy LMT                    1          4 Oracle Co $LIMITED$ Cmpy $LIMITED$                 
Oracle Co Ltd Cmpy LMT                    2          4 Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$

The replacements value tracks how many regex's matched, so if multiple ones did then we can pick out which has the most - for an original value that has both Company and Limited patterns, there's one row for each of the individual replacements, and one that has both applied (again, only one, because of the rn check).

The final CTE t (yes, I struggled to give these meaningful names) just determines which row has the most replacements for each; each original gets one row ranked as 1:

t as (
  select company_name, std_company_name,
    rank() over (partition by company_name order by replacements desc) as rn
  from r
)

... and finally we exclude everything that wasn't ranked first, which leaves:

COMPANY_NAME                   STD_COMPANY_NAME                                 
------------------------------ --------------------------------------------------
Oracle Co                      Oracle $COMPANY$                                   
Oracle Co Ltd Cmpy LMT         Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$     
Oracle Co Ltd.                 Oracle $COMPANY$ Ltd.                              
Oracle Ltd                     Oracle $LIMITED$                                   
Oracle Ltd.                    Oracle Ltd.

You might be able to do something with the model clause instead, but I'm not very familiar with that...

If you do want to stick with a function, if only to learn how this could be done, then I don't think you want a ref cursor, but a bulk collect wouldn't hurt. You can declare a table type within your package that you populate once (per session), and then refer to that when you apply the regexp_replace calls in your function.

Again probably more complete than you really wanted, but there isn't that much to strip out... unless you stop now and just go and read up on PL/SQL collections and what makes a package stageful, in the PL/SQL Language Reference.

Your package specification can declare a record type to hold a regex pattern and replacement string, and a table of those records. And, crucially, a package-level variable of that table type. That will make the package stateful, and the state will be associated with your session - different people running this would have their own copy of that table variable.

create or replace package p42 as
  type regex_rec_type is record(pattern varchar2(4000),
    replace_string varchar2(50));
  type regex_tab_type is table of regex_rec_type;

  regexes regex_tab_type;

  function standardise_company_suffix(company_name in varchar2)
  return varchar2;
end p42;
/

The package body is in two parts; your function and an initialisation block:

create or replace package body p42 as
  function standardise_company_suffix(company_name in varchar2)
  return varchar2 is
    std_company_name varchar2(4000);
  begin
    std_company_name := company_name;
    for i in regexes.first..regexes.last loop
      std_company_name := regexp_replace(std_company_name,
        regexes(i).pattern, regexes(i).replace_string);
    end loop;
    return std_company_name;
  end standardise_company_suffix;

begin
  select listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
      within group (order by length(abbreviation) desc),
    ' $'|| upper( std_value) || '$ '
  bulk collect into regexes
  from std_gbr_co_suffix
  group by std_value;
end p42;
/

The block first; that runs the same listagg query I used before, on the same value-pair table I had to replace yours, so it's getting the same patterns and replacement strings. This are put into the regexes variable declared in the package specification. That happens once per session, when the package is first referenced.

The function starts with the original company name and loops over the collection of regex records, applying each one in turn. And then it returns the final result. Pretty straightforward, but I'll leave you to look in the PL/SQL reference to see exactly what it's doing. I haven't bothered with the regexp_instr as I think it will be more expensive that just applying the replacement, but it's probably worth experimenting with that as I'm not sure.

With the same dummy data:

select company_name,
  p42.standardise_company_suffix(company_name) as std_company_name
from very_large_table;

COMPANY_NAME                   STD_COMPANY_NAME                                 
------------------------------ --------------------------------------------------
Oracle Co                      Oracle $COMPANY$                                   
Oracle Ltd                     Oracle $LIMITED$                                   
Oracle Ltd.                    Oracle Ltd.                                        
Oracle Co Ltd.                 Oracle $COMPANY$ Ltd.                              
Oracle Co Ltd Cmpy LMT         Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$

Although the code for this is arguably much simpler, I would expect this to be considerably slower over a large data set, but at least this should minimise the overhead as you were aiming to do. You'll still have context switches but you're not switching back to SQL within the PL/SQL now, which will help. Again I'd be interested in a performance comparison against what you have now, and the recursive CTE version.