Web Crawler Url Storing At Database - Fast URL Lookup - Hashing - C#

I am writing my own web crawler. Currently i am storing urls directly as uri.absoluteurl . So when i query database whether that url is already added or not i am directly querying database as select pageid from mytable where url='absoluteurl' . I suppose this is causing extra stress on the database because my core i 7 @ 4.5 ghz cpu is almost at 100% all the time.

So it came my mind that if i also store md5 hashes of urls in the database and lookup them whether that url exist or not could increase the lookup speed.

So waiting your ideas about this. For checking whether that url exist in the database or not what would be the best approach ?

c# 4.0, MS-sql 2008

EXAMPLE:

Solution

Since you already have an index on the Url column, my guess is the SELECT (get pageid) then if it doesn't exist INSERT (new URL) is what's causing the CPU to peak. If you're crawler has multiple threads going you might be taxing the concurrency/locking mechanisms in SQL on tblPages.

With regards to your specific question, I would use CHECKSUM (crc) instead of HASHBYTES (md). CHECKSUM is faster, it returns an INT instead of a VARBINARY so it would be easier/faster to index.

However, precisely because CHECKSUM returns an INT it is prone to collisions so you should also search the URL as an AND clause.

SELECT PageId FROM tblPages WHERE HashedUrl=CHECKSUM(@url) AND PageUrl=@url

Now ONLY put a column index on HashedUrl (not PageUrl). The index will have to be NON-UNIQUE because of the possibility for collisions. This will give you the fastest INSERTs and SELECTs until you start getting to a table row count well over 4billion, in which case, the amount of INT CHECKSUM collisions will cause a lot of partial table scans on the un-indexed PageUrl column.

UPDATE

Here's the simple benchmark code I used

GO
/* NORMAL METHOD */
BEGIN
SET STATISTICS TIME ON
--
IF EXISTS(SELECT * FROM tempdb.dbo.sysobjects WHERE ID = OBJECT_ID(N'tempdb..#Store1'))
BEGIN
    DROP TABLE #Store1
END
-- Normal
CREATE TABLE #Store1 (Id INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED, Data VARCHAR(4000))
CREATE UNIQUE CLUSTERED INDEX CIX_STORE1_DATA ON #Store1(Data)
-- Help Create Data
DECLARE @Data TABLE(Data VARCHAR(4000))
INSERT INTO @Data(Data) VALUES ('red.'), ('YELLOW/'), ('green'), ('.BLUE'), ('/violet'), ('PURPLE-'), ('-orange')
-- The data set we'll use for testing
INSERT INTO @Data
    SELECT a.Data + b.Data + c.Data + d.Data + e.Data + f.Data + g.Data 
    FROM @Data a, @Data b, @Data c, @Data d, @Data e, @Data f, @Data g
-- INSERTION TESTS
PRINT('INSERT INTO NORMAL')
INSERT INTO #Store1(Data)
    SELECT Data FROM @Data
-- SELECTION TESTS
PRINT('SELECT FROM NORMAL')
SELECT TOP 5000 d.Data, (SELECT s.Id FROM #Store1 s WHERE s.Data = d.Data) FROM @Data d 
--
SET STATISTICS TIME OFF
END 
GO
/* USING YOUR OWN CHECKSUM/HASH */
BEGIN
SET STATISTICS TIME ON
--
IF EXISTS(SELECT * FROM tempdb.dbo.sysobjects WHERE ID = OBJECT_ID(N'tempdb..#Store2'))
BEGIN
    DROP TABLE #Store2
END
-- With Hash
CREATE TABLE #Store2 (Id INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED, Hsh INT, Data VARCHAR(4000))
CREATE CLUSTERED INDEX CIX_STORE2_CRC ON #Store2(Hsh)
-- Help Create Data
DECLARE @Data TABLE(Data VARCHAR(4000))
INSERT INTO @Data(Data) VALUES ('red.'), ('YELLOW/'), ('green'), ('.BLUE'), ('/violet'), ('PURPLE-'), ('-orange')
-- The data set we'll use for testing
INSERT INTO @Data
    SELECT a.Data + b.Data + c.Data + d.Data + e.Data + f.Data + g.Data 
    FROM @Data a, @Data b, @Data c, @Data d, @Data e, @Data f, @Data g
-- INSERTION TESTS
PRINT('INSERT INTO CHECKSUM/HASH')
INSERT INTO #Store2(Hsh, Data)
    SELECT CHECKSUM(Data), Data FROM @Data
-- SELECTION TESTS
PRINT('SELECT FROM CHECKSUM/HASH')
SELECT TOP 5000 d.Data, (SELECT s.Id FROM #Store2 s WHERE Hsh = CHECKSUM(d.Data) AND Data = d.Data) FROM @Data d
--
SET STATISTICS TIME OFF
END

Results (in brief) my method achieves faster (+30%) INSERTS "elapsed time = 7339 ms" vs "elapsed time = 10318 ms", however, slower (-30%) SELECTS "elapsed time = 37 ms" vs "elapsed time = 28 ms".

Another interesting note is you can't "correctly" INDEX a URL VARCHAR field because the length (according to http spec ~4kb) would be greater than 900 bytes (SQL 2008's max allowable key size). While SQL only gives a warning for this, the warning does note that some INSERTS/UPDATES could potentially fail.

Warning! The maximum key length is 900 bytes. The index 'CIX_STORE1_DATA' has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.

I'm not a SQL Guru per se, perhaps my testing method isn't the most accurate/useful, but the topic is very interesting with regards not sensible user-end optimization versus the 'black box'.