I need a function:
f(fqdn,suffix) -> basedomain
with these example inputs and outputs:
f('foobar.quux.somedomain.com','com') -> 'somedomain.com'
f('somedomain.com','com') -> 'somedomain.com'
f('foobar.quux.somedomain.com.br','com.br') -> 'somedomain.com.br'
f('somedomain.com.br','com.br') -> 'somedomain.com.br'
In plain English, if the suffix
has n
segments, take the last n+1
segments. Find the base domain for the FQDN, allowing for the fact that some FQDNs have more than one suffix element.
The suffixes I need to match are here. I've already got them in my SQL database.
I could write this in C#; it might not be the most elegant but it would work. Unfortunately I would like to have this function in either T-SQL, where it is closest to the data, or in Powershell, which is where the rest of the utility that consumes this data is going to be. I suppose it would be ok to do it in C#, compile to an assembly and then access it from T-SQL, or even from Powershell ... if that would be the fastest executing. If there's some reasonably clever alternative in pure T-SQL or simple Powershell, I'd like that.
EDIT: One thing I forgot to mention explicitly (but which is clear when reviewing the suffix list, at my link above) is that we must pick the longest matching suffix. Both "br" and "com.br" appear in the suffix list (with similar things happening for uk, pt, etc). So the SQL has to use a window function to make sure the longest matching suffix is found.
Here is how far I got when I was doing the SQL. I had gotten lost in all the substring
/reverse
functions.
SELECT Domain, suffix
FROM (
SELECT SD.Domain, SL.suffix,
RN=ROW_NUMBER() OVER (
PARTITION BY sd.Domain ORDER BY LEN(SL.suffix) DESC)
FROM SiteDomains SD
INNER JOIN suffixlist SL ON SD.Domain LIKE '%.'+SL.suffix
) AS X
WHERE RN=1
This works ok for finding the right suffix. I'm a little concerned about its performance though.
The following demonstrates matching FQDNs with TLDs and extracting the desired n + 1 domain name segments:
-- Sample data.
declare @SampleTLDs as Table ( TLD VarChar(64) );
insert into @SampleTLDs ( TLD ) values
( 'com' ), ( 'somedomain.com' ), ( 'com.br' );
declare @SampleFQDNs as Table ( FQDN VarChar(64) );
insert into @SampleFQDNs ( FQDN ) values
( 'foobar.quux.somedomain.com' ), ( 'somedomain.com' ),
( 'foobar.quux.somedomain.com.br' ), ( 'somedomain.com.br' );
select * from @SampleTLDs;
select * from @SampleFQDNs;
-- Fiddle about.
select FQDN, TLD,
case
when DotPosition = 0 then FQDN
else Reverse( Left( ReversedPrefix, DotPosition - 1) ) + '.' + TLD
end as Result
from (
select FQDNs.FQDN, TLDs.TLD,
Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) as ReversedPrefix,
CharIndex( '.', Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) ) as DotPosition
from @SampleFQDNs as FQDNs inner join
@SampleTLDs as TLDs on FQDNs.FQDN like '%.' + TLDs.TLD or FQDNs.FQDN = TLDs.TLD ) as Edna;
-- To select only the longest matching TLD for each FQDN:
with
ExtendedFQDNs as (
select FQDNs.FQDN, TLDs.TLD, Row_Number() over ( partition by FQDN order by Len( TLDs.TLD ) desc ) as TLDLenRank,
Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) as ReversedPrefix,
CharIndex( '.', Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) ) as DotPosition
from @SampleFQDNs as FQDNs inner join
@SampleTLDs as TLDs on FQDNs.FQDN like '%.' + TLDs.TLD or FQDNs.FQDN = TLDs.TLD )
select FQDN, TLD,
case
when DotPosition = 0 then FQDN
else Reverse( Left( ReversedPrefix, DotPosition - 1) ) + '.' + TLD
end as Result
from ExtendedFQDNs
where TLDLenRank = 1;