Search code examples
mysqllockingweb-crawlermysql-error-1093

How do I lock read/write to MySQL tables so that I can select and then insert without other programs reading/writing to the database?


I am running many instances of a webcrawler in parallel.

Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.

Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.

I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).

How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.


This code seems like a good solution (see the error below, however):

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT companies.id FROM companies
        LEFT OUTER JOIN crawlLog
        ON companies.id = crawlLog.companyId
        WHERE crawlLog.companyId IS NULL
        LIMIT 1
    ),
    now()
)

but I keep getting the following mysql error:

You can't specify target table 'crawlLog' for update in FROM clause

Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT id
        FROM companies
        WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
    ),
    now()
)

Solution

  • I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.

    INSERT INTO crawlLog (companyId, timeStartCrawling)
    SELECT id, now()
    FROM companies
    WHERE id NOT IN
    (
        SELECT companyId
        FROM crawlLog AS crawlLogAlias
    )
    LIMIT 1