I am trying to select same substring of multiple rows without specifying the substring first.
Basically find the common part in a string. The common parts in a string range from 2 to 5 words.
Here is what I am trying to achieve:
The goal is to find the longest/last common denominator of multiple rows, creating a group / a set. The substring at the end of a string that is unique to a single row in a set should be trimmed.
Rows with that have same substring from left are considered part of a set.
"Item A 1" and "Item A 2" are part of "Item A" set.
"Item B 1" and "Item B 2" are part of "Item B" set.
The sets can include multiple parts from string.
"Item A A 1" and "Item A A 2" are part of "Item A A" set.
"Item A B 1" and "Item A B 2" are part of "Item A B" set.
Also value of the row might be just the name of a set.
"Item A" and "Item A 1" are part of "Item A" set.
"Item A B" and "Item A B 1" are part of "Item A B" set.
id | item | set |
---|---|---|
1 | DEP Item 1 | DEP Item |
2 | DEP Item 10 | DEP Item |
3 | DEP Item 11 | DEP Item |
4 | DEP Item 24 | DEP Item |
5 | DEP G1 Item B 1 | DEP G1 Item B |
6 | DEP G1 Item B 10 | DEP G1 Item B |
7 | DEP G1 Item B 11 | DEP G1 Item B |
8 | DEP G1 Item B 24 | DEP G1 Item B |
9 | DEP A Item B 1 1 A | DEP A Item B 1 |
10 | DEP A Item B 1 10 | DEP A Item B 1 |
11 | DEP A Item B 1 11 B | DEP A Item B 1 |
12 | DEP A Item B 1 24 A B C | DEP A Item B 1 |
Here is sample data and a query I've been trying, which sort of gives correct results, but not in the required way.
CREATE TABLE #temp (
id INT,
item NVARCHAR(50)
);
INSERT INTO #temp (id, item) VALUES
(1,'DEP Item 1'),
(2,'DEP Item 10'),
(3,'DEP Item 11'),
(4,'DEP Item 24'),
(5,'DEP G1 Item B 1'),
(6,'DEP G1 Item B 10'),
(7,'DEP G1 Item B 11'),
(8,'DEP G1 Item B 24'),
(9,'DEP A Item B 1 1 A'),
(10,'DEP A Item B 1 10'),
(11,'DEP A Item B 1 11 B'),
(12,'DEP A Item B 1 24 A B C')
select *,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 1 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item)+1))
end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 2 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item)+1)+1))
end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 4 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item)+1)+1)+1))
end,
CASE WHEN LEN(item)-LEN(REPLACE(item, ' ', '')) < 5 THEN item
ELSE LEFT(item, CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item,
CHARINDEX(' ', item)+1)+1)+1)+1))
end
from #temp
First, we make up all possible combinations of word sets for each id(item) where order of words is remains - this is path's
.
6, DEP G1 Item B 1
->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,1)
9, DEP G1 Item B 11
->(DEP), (DEP,G1), (DEP,G1,Item), (DEP,G1,Item,B), (DEP,G1, Item,B,11)
Then select rows with common path's - thru comparing path's for every item with all item's path's - select rows with inner self join on t1.path=t2.path and t1.id<>t2.id.
And last - take greatest path's (order by count of words in path desc) for items.
See example
with r as(
select id,1 n,item
,case when charindex(' ',item)>0 then
trim(substring(item,charindex(' ',item)+1,100))
else ''
end rest
,cast(case when charindex(' ',item)>0 then
trim(substring(item,1,charindex(' ',item)-1))
else trim(item)
end as varchar(100)) path
from #temp
union all
select id,n+1 n,item
,case when charindex(' ',rest)>0 then
trim(substring(rest,charindex(' ',rest)+1,100))
else ''
end rest
,cast(concat(path,','
,case when charindex(' ',rest)>0 then
trim(substring(rest,1,charindex(' ',rest)-1))
else trim(rest)
end) as varchar(100)) path
from r where len(rest)>0
)
select s.path,n,t.id,t.item
from #temp t
left join (
select t1.id,t1.path,t1.n
,row_number()over(partition by t1.id order by t1.n desc) rn
from r t1
inner join r t2 on t1.path=t2.path and t1.id<>t2.id
) s on s.rn=1 and s.id=t.id
order by n,path,t.id
output is
path (set ) |
n | id | item |
---|---|---|---|
DEP | 1 | 1 | DEP |
DEP,Item | 2 | 2 | DEP Item 1 |
DEP,Item | 2 | 3 | DEP Item 10 |
DEP,Item | 2 | 5 | DEP Item 24 |
DEP,G1,Item,B | 4 | 6 | DEP G1 Item B 1 |
DEP,G1,Item,B | 4 | 8 | DEP G1 Item B 11 |
DEP,G1,Item,B | 4 | 9 | DEP G1 Item B 24 |
DEP,A,Item,B,1 | 5 | 11 | DEP A Item B 1 10 |
DEP,A,Item,B,1 | 5 | 13 | DEP A Item B 1 24 A B C |
DEP,A,Item,B,1,1,A | 7 | 10 | DEP A Item B 1 1 A |
DEP,A,Item,B,1,1,A | 7 | 14 | DEP A Item B 1 1 A B C |
null | null | 15 | Hello world |
There item id=15 not have common words with other rows.
on test data
CREATE TABLE #temp (id INT, item NVARCHAR(50));
INSERT INTO #temp (id, item) VALUES
(1,'DEP'),
(2,'DEP Item 1'),
(3,'DEP Item 10'),
--(4,'DEP Item 11'),
(5,'DEP Item 24'),
(6,'DEP G1 Item B 1'),
--(7,'DEP G1 Item B 10'),
(8,'DEP G1 Item B 11'),
(9,'DEP G1 Item B 24'),
(10,'DEP A Item B 1 1 A'),
(11,'DEP A Item B 1 10'),
--(12,'DEP A Item B 1 11 B'),
(13,'DEP A Item B 1 24 A B C'),
(14,'DEP A Item B 1 1 A B C'),
(15,'Hello world')
;