All,
Whilst I’m not new to SQL it’s been a long long time and I’m very new to Oracle. Please note also I know that the way I have done this is probably the slowest possible solution, but as I said I’m new to this and it worked on the 1000 or so Items I needed to test. The problem is that it DID work so I want to expand this to test 1,000,000’s and I know this is a donkey not a race horse.
Had a small task to “Standardise” a list of company names, so I created a table with known abbreviations
Col1 Col2 Col3 Col4 …. Col14
Company Co Cmpy
Limited Ltd Lim LMT
Etc
I then wrote a function as to select this table into a cursor, and loop through it for each name sent to it and replace any “NON” standard abbreviations with that in Col 1, using the following:
What I have been thinking about is re-writing the code (obviously) and putting it into a package as I believe that that should result in considerable speed and clarity compared to now.
The problem I have is not really improving the looping part of the function but what I need to do from a Select that calls this function to ensure it loaded the STD_GBR_CO_SUFFIX only once and re-uses the cursor.
I've been reading some of the O'Reilly and I think that what I want to do is possible, I just can't figure out how.
I think I need to put this in a package, and use a REF Cursor, and BULK LOAD looks a contender, but the more I read the more confused I get. If someone out there could just could just point me in the right direction I can carry on from there. What I don't want is for someone to just write the solution I want to learn how to do this better not have it done for me.
Thank you all for your assistance in advance.
So if I do a
Select Standardise_Company_Suffix(company_name) AS STD_Company_Name
From VERY_LARGE_TABLE
CREATE OR REPLACE Function Standardise_Company_Suffix(Co_Name IN VARCHAR2)
RETURN varchar2 IS
stg_Co_Name varchar2(400 byte);
fmt_S varchar2(20 byte);
fmt_E varchar2(20 byte);
parse_Str varchar2(2400 byte);
parse_char varchar2(4 byte);
CURSOR C1 IS
-- Populate Cursor with Table of Suffixes
select * from STD_GBR_CO_SUFFIX;
BEGIN
parse_char := "s";
Fmt_S := '(^|\'||parse_char||')';
Fmt_E := '(\'||parse_char||'|$)';
stg_Co_Name := upper(co_name);
parse_str := ' ';
FOR c1pass IN C1 LOOP
parse_str :='';
If c1pass.column14 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column14)||Fmt_E; End If;
If c1pass.column13 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column13)||Fmt_E; End If;
If c1pass.column12 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column12)||Fmt_E; End If;
If c1pass.column11 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column11)||Fmt_E; End If;
If c1pass.column10 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column10)||Fmt_E; End If;
If c1pass.column9 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column9)||Fmt_E; End If;
If c1pass.column8 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column8)||Fmt_E; End If;
If c1pass.column7 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column7)||Fmt_E; End If;
If c1pass.column6 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column6)||Fmt_E; End If;
If c1pass.column5 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column5)||Fmt_E; End If;
If c1pass.column4 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column4)||Fmt_E; End If;
If c1pass.column3 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column3)||Fmt_E; End If;
If c1pass.column2 is not null then parse_str := parse_str||'|'||Fmt_S||Upper(c1pass.column2)||Fmt_E; End If;
Parse_str := substr(parse_str,2);
If regexp_instr(stg_Co_Name,parse_str) <> 0 Then
stg_Co_Name := regexp_REPLACE(stg_Co_Name,parse_str,
' $'||UPPER(c1pass.column1||'$ '));
Else
stg_Co_Name := stg_Co_Name;
End if;
END LOOP;
return stg_Co_Name;
End;
/
This isn't really the kind of direction you were asking for, but if possible I'd forget about the function - which will at least incur context switching, but probably quite a lot of other overhead - and try to do something in native SQL. I'd also change your look-up table as that isn't very flexible to have multiple columns. I'd have a single standard-abbreviation pair in each row:
STD_VALUE ABBREVIATION
-------------------- ------------
Bar Ba
Company Co
Company Cmpy
Foo Fo
Limited Ltd
Limited Lim
Limited LMT
This seems to work, at least for some basic test cases, but you would need to be on 11gR2, so you have both the listagg
function and recursive subquery factoring (a.k.a recursive CTE or recursive with
). I very much applaud the sentiment that you don't want someone to just write the solution, but I had fun figuring it out and it seems a shame to waste it. I've tried to explain what's going on below; hopefully you can learn from that still. And it will probably need some modification so you'll need to understand it...
with p as (
select ' $'|| upper( std_value) || '$ ' as replace_string,
listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
within group (order by length(abbreviation) desc) as pattern,
rank() over (order by std_value) as rn
from std_gbr_co_suffix
group by std_value
),
r (company_name, replacements, rn, std_company_name) as (
select company_name, 0, 0, company_name from very_large_table
union all
select r.company_name, r.replacements + 1, p.rn,
regexp_replace(r.std_company_name, p.pattern, p.replace_string)
from r
join p on p.rn > r.rn and regexp_like(r.std_company_name, p.pattern)
),
t as (
select company_name, std_company_name,
rank() over (partition by company_name order by replacements desc) as rn
from r
)
select company_name, std_company_name
from t
where rn = 1;
SQL Fiddle demo with a few test cases; be interested to know how it fairs against your 1000-odd test items, both in terms of correctness and performance.
Breaking it down a bit, the first CTE p
:
p as (
select ' $'|| upper( std_value) || '$ ' as replace_string,
listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
within group (order by length(abbreviation) desc) as pattern,
rank() over (order by std_value) as rn
from std_gbr_co_suffix
group by std_value
)
... generates the pattern and replacement strings for each distinct std_value
(which is column1
in your original table):
REPLACE_STRING RN PATTERN
------------------------ ---------- --------------------------------------------------
$BAR$ 1 (^|\s)Ba(\s|$)
$COMPANY$ 2 (^|\s)Cmpy(\s|$)|(^|\s)Co(\s|$)
$FOO$ 3 (^|\s)Fo(\s|$)
$LIMITED$ 4 (^|\s)LMT(\s|$)|(^|\s)Lim(\s|$)|(^|\s)Ltd(\s|$)
The second CTE r
is the recursive one:
r (company_name, replacements, rn, std_company_name) as (
select company_name, 0, 0, company_name from very_large_table
union all
select r.company_name, r.replacements + 1, p.rn,
regexp_replace(r.std_company_name, p.pattern, p.replace_string)
from r
join p on p.rn > r.rn and regexp_like(r.std_company_name, p.pattern)
)
It starts with the original values from your table as the anchor member, and then recursively applies each pattern from p
. (The rn
in the join is to stop it applying patterns both ways, in case an original name matches more than one; you'd do more work than necessary and end up with duplicate results without that). For my dummy data that gives:
COMPANY_NAME REPLACEMENTS RN STD_COMPANY_NAME
------------------------------ ------------ ---------- --------------------------------------------------
Oracle Co 0 0 Oracle Co
Oracle Ltd 0 0 Oracle Ltd
Oracle Ltd. 0 0 Oracle Ltd.
Oracle Co Ltd. 0 0 Oracle Co Ltd.
Oracle Co Ltd Cmpy LMT 0 0 Oracle Co Ltd Cmpy LMT
Oracle Co 1 2 Oracle $COMPANY$
Oracle Ltd 1 4 Oracle $LIMITED$
Oracle Co Ltd Cmpy LMT 1 2 Oracle $COMPANY$ Ltd $COMPANY$ LMT
Oracle Co Ltd Cmpy LMT 1 4 Oracle Co $LIMITED$ Cmpy $LIMITED$
Oracle Co Ltd. 1 2 Oracle $COMPANY$ Ltd.
Oracle Co Ltd Cmpy LMT 2 4 Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$
You can see that an original name can have several substitutions, and the original value itself; specifically here:
Oracle Co Ltd Cmpy LMT 0 0 Oracle Co Ltd Cmpy LMT
Oracle Co Ltd Cmpy LMT 1 2 Oracle $COMPANY$ Ltd $COMPANY$ LMT
Oracle Co Ltd Cmpy LMT 1 4 Oracle Co $LIMITED$ Cmpy $LIMITED$
Oracle Co Ltd Cmpy LMT 2 4 Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$
The replacements
value tracks how many regex's matched, so if multiple ones did then we can pick out which has the most - for an original value that has both Company
and Limited
patterns, there's one row for each of the individual replacements, and one that has both applied (again, only one, because of the rn
check).
The final CTE t
(yes, I struggled to give these meaningful names) just determines which row has the most replacements for each; each original gets one row ranked as 1
:
t as (
select company_name, std_company_name,
rank() over (partition by company_name order by replacements desc) as rn
from r
)
... and finally we exclude everything that wasn't ranked first, which leaves:
COMPANY_NAME STD_COMPANY_NAME
------------------------------ --------------------------------------------------
Oracle Co Oracle $COMPANY$
Oracle Co Ltd Cmpy LMT Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$
Oracle Co Ltd. Oracle $COMPANY$ Ltd.
Oracle Ltd Oracle $LIMITED$
Oracle Ltd. Oracle Ltd.
You might be able to do something with the model
clause instead, but I'm not very familiar with that...
If you do want to stick with a function, if only to learn how this could be done, then I don't think you want a ref cursor, but a bulk collect wouldn't hurt. You can declare a table type within your package that you populate once (per session), and then refer to that when you apply the regexp_replace
calls in your function.
Again probably more complete than you really wanted, but there isn't that much to strip out... unless you stop now and just go and read up on PL/SQL collections and what makes a package stageful, in the PL/SQL Language Reference.
Your package specification can declare a record type to hold a regex pattern and replacement string, and a table of those records. And, crucially, a package-level variable of that table type. That will make the package stateful, and the state will be associated with your session - different people running this would have their own copy of that table variable.
create or replace package p42 as
type regex_rec_type is record(pattern varchar2(4000),
replace_string varchar2(50));
type regex_tab_type is table of regex_rec_type;
regexes regex_tab_type;
function standardise_company_suffix(company_name in varchar2)
return varchar2;
end p42;
/
The package body is in two parts; your function and an initialisation block:
create or replace package body p42 as
function standardise_company_suffix(company_name in varchar2)
return varchar2 is
std_company_name varchar2(4000);
begin
std_company_name := company_name;
for i in regexes.first..regexes.last loop
std_company_name := regexp_replace(std_company_name,
regexes(i).pattern, regexes(i).replace_string);
end loop;
return std_company_name;
end standardise_company_suffix;
begin
select listagg('(^|\s)' || abbreviation || '(\s|$)', '|')
within group (order by length(abbreviation) desc),
' $'|| upper( std_value) || '$ '
bulk collect into regexes
from std_gbr_co_suffix
group by std_value;
end p42;
/
The block first; that runs the same listagg
query I used before, on the same value-pair table I had to replace yours, so it's getting the same patterns and replacement strings. This are put into the regexes
variable declared in the package specification. That happens once per session, when the package is first referenced.
The function starts with the original company name and loops over the collection of regex records, applying each one in turn. And then it returns the final result. Pretty straightforward, but I'll leave you to look in the PL/SQL reference to see exactly what it's doing. I haven't bothered with the regexp_instr
as I think it will be more expensive that just applying the replacement, but it's probably worth experimenting with that as I'm not sure.
With the same dummy data:
select company_name,
p42.standardise_company_suffix(company_name) as std_company_name
from very_large_table;
COMPANY_NAME STD_COMPANY_NAME
------------------------------ --------------------------------------------------
Oracle Co Oracle $COMPANY$
Oracle Ltd Oracle $LIMITED$
Oracle Ltd. Oracle Ltd.
Oracle Co Ltd. Oracle $COMPANY$ Ltd.
Oracle Co Ltd Cmpy LMT Oracle $COMPANY$ $LIMITED$ $COMPANY$ $LIMITED$
Although the code for this is arguably much simpler, I would expect this to be considerably slower over a large data set, but at least this should minimise the overhead as you were aiming to do. You'll still have context switches but you're not switching back to SQL within the PL/SQL now, which will help. Again I'd be interested in a performance comparison against what you have now, and the recursive CTE version.