We try to use Full text search on Azure database and got performance problems on using CONTAINS search.
Our data has star schema, Fact table has clustered column store index enabled and around 40 million rows. Below is how we use CONTAINS on dimension and do aggregation on Fact table on different queries:
Query 1 using EXISTS:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE EXISTS (
SELECT * FROM [SPENDBY].[DimCompanyCode] d
WHERE f.[FK_DimCompanyCodeId] = d.Id
AND CONTAINS(d.*, 'Comcast'))
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
This query seems run forever and never return the result.
There is non-clustered index on the foreign key FK_DimCompanyCodeId]
and there is only one row returned when searching Comcast
:
SELECT id FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.*, 'Comcast');
-- will return id = 5
And there are around 27 million rows of Fact table which has FK_DimCompanyCodeId = 5
.
Query 2 using INNER JOIN:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
INNER JOIN [SPENDBY].[DimCompanyCode] d ON (f.[FK_DimCompanyCodeId] = d.Id)
WHERE CONTAINS(d.*, 'Comcast')
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
This query seems run forever and never return the result as well.
Query 3 using #temp table:
SELECT id INTO #temp FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.*, 'Comcast');
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE EXISTS (
SELECT * FROM #temp
WHERE f.[FK_DimCompanyCodeId] = #temp.Id)
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
Very fast, returns the result after 5 seconds.
Why full text search is so slow for in case 1 and case 2.
Eventually, I figured out CONTAINS
works well on specific column (Description
for example):
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE f.[FK_DimCompanyCodeId] IN (
SELECT d.Id FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.[Description], 'Comcast')
)
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
In order to search for the whole table, CONTAINSTABLE
will have the best performance and avoid using #temp table:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
LEFT OUTER JOIN CONTAINSTABLE([SPENDBY].[DimCompanyCode], *, '"Comcast"') ct
ON f.[FK_DimCompanyCodeId] = ct.[Key]
WHERE ct.[Key] IS NOT NULL
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC