Search code examples
mysqlsqlaws-lambdamysql-5.6

My SQL 5.6 - how to prevent Reads on rows selected by a previous SQL statement PLUS any other row which share a duplicate value in a given column


I've got a very specifc and quite complex need to prevent Reads coming from massively concurrent (same second, sometimes same milli-second) requests coming from distinct servers (to be precise, they're AWS lambdas) on a table called Hobby_ideas_articles.

Set-up:

  • mySQL 5.6

  • on aws aurora serverless MySQL (autocommit is off by default)

I read of course a lot of posts about row locks and think they might be part of the solution but I think I'm not in the basic select...for update case.

My table is Hobby_ideas_articles and has records such as:

hobby_idea_article_id= 1,
hobby_id = 6
url= 'http://exo.example.com',
author = '[email protected]'

hobby_idea_article_id= 2,
hobby_id = 3
url= 'http://exo.example2.com',
author = '[email protected]'

hobby_idea_article_id= 3,
hobby_id = 7
url= 'http://exo.example3.com',
author = '[email protected]'

and I have another table called Past_Customer_sent_messages where records look like :

past_customer_sent_message_id = 5
hobby_id = 7,
customer_id = 4,
recipient = "[email protected]",
sent_at= "2019-09-10 00:00:00"

past_customer_sent_message_id = 6
hobby_id = 999,
customer_id = 4,
recipient = "[email protected]",
sent_at= "2019-09-18 00:00:00"

past_customer_sent_message_id = 7
hobby_id = 999,
customer_id = 4,
recipient = "[email protected]",
sent_at= "2019-07-18 00:00:00"

I have today a functioning SQL statement which, based on 2 inputs (hobby_id and customer_id) (different values on each lambdas), goes to fetch all Hobby_ideas_articles with this given hobby_id and exclude/filter out any result when a message has been sent recently to the author (by any customer within x days and by the specific customer_id within y hours) (to have more details on the specifics of these conditions/restrictions: MySQL - Select data with a JOIN but with WHERE clauses applying to complex and distinct perimeters).

SELECT             
          hia.hobby_idea_article_id,
          hobby_id,
          url,
          author,
          ces.sent_at
FROM
          Hobby_ideas_articles hia
LEFT JOIN
          Past_Customer_sent_messages ces
ON
          hia.author = ces.recipient 

WHERE
          hia.hobby_id = HOBBY_ID_INPUT_I_HAVE AND         
          hia.author IS NOT NULL
          AND hia.author NOT IN (
            SELECT recipient
            FROM Past_Customer_sent_messages
            WHERE 
              (
                customer_id = CUSTOMER_ID_INPUT_I_HAVE
                AND sent_at > DATE_SUB(NOW(), INTERVAL 30 DAY)
              ) OR
              ( 
                sent_at > DATE_SUB(NOW(), INTERVAL 3 HOUR
              )
            )
          )
GROUP BY hia.author
ORDER BY hia.hobby_idea_article_id ASC
LIMIT 20

That means for example :

  • at 10:05:03 am, a lambda will execute the statement for hobby_idea_article_id= 4 and customer_id= 7
  • at 10:05:04am, just a sub millisecond after, another lambda will execute the statement for hobby_idea_article_id= 12 and customer_id= 8 ...and so on...

The only "business logic" guarantee is that I will never have 2 concurrent lambdas with the same input couple (hobby_id, customer_id).

So this current SO question is about how to make sure a customer NEVER sends two quick emails (one following the other just a few seconds after the other) to the same recipient when dealing with requests coming from massively concurrent lambdas ?

An illustration of the problem is:

which means i'll send [email protected] and [email protected] an email a few seconds later (performed by another lambda which takes care of the emailing with data passed to it)

  • at 10:05:03 am, a parallel lambda executed at the same very second/ms executes the SQL statement for hobby_idea_article_id= 4 and customer_id= 7 (indeed I can have 8 customers wanting ideas about hobby "fishing with Id=4 !). This lambda retrieves roughly the same data as the first lambda (as you see on the SQL statement, the customer_id input is only used to filter out authors if they have already gotten a message by this specific Customer). Let's say for the sake of the example that it filters out john as john was already messaged 12 days ago by the customer with customer_id= 7, so the retrieved data here is:

    hobby_idea_article_id= 3,
    hobby_id = 4
    url= 'http://exo.example3.com',
    author = '[email protected]'
    

which means I'll send [email protected] an email a few seconds later (performed by another lambda which was passed this data)

Here is the issue: [email protected] is going to receive 2 quick emails but I absolutely don't want to allow such thing. The protection I have in place inside the current SQL statement (see condition 1 and 2 explained here), only protects against these repeated quick emails when I can use the persisted information about emails already sent on Past_Customer_sent_messages but as this happens so close / so concurrently, the second lambda won't see there was already (or more precisely "is going to be " by another lambda a few seconds later) a sent message to [email protected]. I need to make sure the second lambda will NOT output the hobby_idea with author=eric to prevent such double emailing.

I have two ideas solutions, but I think the second one is better as there's an issue with the first one.

1. Solution 1 - Use a row lock with select ...for update ?

This way, when the first lambda hits the SQL, it will prevent READs on all the rows of the SQL query output rows, making them, if i understand correctly, "invisible" to any subsequent SELECT. That means if the second lambda arrives concurrently, the result of the first lambda's SQL statement row will not be even considered/found!

After reading up, I thought about doing it in a transaction and moving ALL hobby_idea_articles which are the results of the first SQL statement with a status "currently_locked_for_emailing", and assign a value of true, and then unlock by "commiting" the transaction.

Then, when I have actually sent the email from another lambda, AND only after having actually persisted/written on the database on the Past_Customer_sent_messages table the data about this email sent**, I'll change back the status of 'currently_locked_for_emailing' to false**.

The row lock would be useful for me in that context to make sure while I'm changing /updating the status (these few milliseconds), to be sure no other lambda can read the data.

Would this SQL below statement work ? Notice the transaction and the new WHERE clause on 'currently_locked_for_emailing'

-- (A) start a new transaction
START TRANSACTION;

-- (B) Get the latest order number
SELECT             
          hia.hobby_idea_article_id,
          hobby_id,
          url,
          author,
          ces.sent_at
FROM
          Hobby_ideas_articles hia
LEFT JOIN
          Past_Customer_sent_messages ces
ON
          hia.author = ces.recipient      
WHERE
          hia.hobby_id = HOBBY_ID_INPUT_I_HAVE AND         
          hia.author IS NOT NULL              
          AND hia.author NOT IN (
            SELECT recipient
            FROM Past_Customer_sent_messages
            WHERE 
              (
                customer_id = CUSTOMER_ID_INPUT_I_HAVE
                AND sent_at > DATE_SUB(NOW(), INTERVAL 30 DAY)
              ) OR
              ( 
                sent_at > DATE_SUB(NOW(), INTERVAL 3 HOUR
              )
            )
          ) AND
          # NEW CLAUSE ON currently_locked_for_emailing 
          # THAT GOES ALONG WITH THE ROW LOCK STRATEGY
          hia.currently_locked_for_emailing = false
GROUP BY hia.author
ORDER BY hia.hobby_idea_article_id ASC
LIMIT 20

# ADD THE NEW FOR UPDATE FOR THE ROW LOCK
FOR UPDATE

-- (C). Update the column `currently_locked_for_emailing` to `true`

UPDATE Hobby_ideas_articles
SET currently_locked_for_emailing = true
WHERE
  ############### how to say do it for all the same rows which are the result of the 
  previous SQL statement on above (see (B)

-- (D) commit changes    
COMMIT;

1.1 Can you help me fix the SQL code above?

1.2 It feels wrong to update currently_locked_for_emailing to true after having put the lock but how to do it before?

1.3 Also I don't know how to assert 'please change currently_locked_for_emailing to true for all the rows , which are the result of the SQL inside (A) above ?

1.4 how to "unlock" the transaction ? indeed just after updating the currently_locked_for_emailing status, I'm OK to unlock ti for reads and writes but how to do this? Indeed i don't want to wait the end of the connection with the server. Please can you confirm the lock will be REMOVED as soon as it reaches transaction 'COMMIT' on (D) ?

1.5 is that correct to say that the code above only locks ALL the rows which are the resulting output of the SELECT but not ALL the rows on the whole table? If yes, does it mean that by using the LIMIT 20, it will only lock the 20 rows of the results, and not all the matching rows( i mean corresponding to the WHERE clause) , that's be fine but I'd like to be sure of this.

1.6 I read in a lot ot SO posts (here, that for a row lock to work, you must absolutely have an index... One person even says here "My own tests show that using for update with where filters on non-indexed columns results in whole-table locking, while with where filters on indexed columns results in the desired behaviour of filtered row locking. " is that true, on what should I put it then, it's not like my where is a simple one on 1 or two columns...an index on all my where clauses columns would be crazy complex no?

2. Solution 2 - complement the select...update because even if i get the 1. right,I still have an important issue:

If I understand correctly that a 'row lock' locks ALL the rows which were inside the result of the SELECT, then here is the issue. But the real lock I need is not only for the rows which are the result of the select but I need to put a row lock to ANY row where author has the same value with a row which was inside the result of the SELECT

Let me explain why with an example, where I take the same data as 1.

  • at 10:05:03 am, a lambda executes the SQL statement for hobby_id= 4 and customer_id= 3 and retrieve these data:

    hobby_idea_article_id= 2,
    hobby_id = 4
    url= 'http://exo.example2.com',
    author = '[email protected]'
    
    hobby_idea_article_id= 3,
    hobby_id = 4
    url= 'http://exo.example3.com',
    author = '[email protected]'
    

...which means i'll send [email protected] and [email protected] an email a few seconds later (performed by another lambda which was passed this data)

  • with the row lock solution of 1. implemented, we know now that a second lambda would NOT be able to select the first 2 records above with hobby_idea_article_id 2 and 3) (cool!) because it would :
    • either run into the row lock (these rows are invisible to him) if things happen very very concurrently,
    • OR because it would not select them because they have now 'currently_locked_for_emailing'= true (see the new SQL statement WHERE clause currently_locked_for_emailing = 'false',
    • OR because the email has been sent and we already persisted the fact it has been sent on Past_Customer_sent_messages.

...But I still have a BIG issue.

  • at 10:05:03 am, a second lambda executes the SQL statement for hobby_id= 9 (this is ANOTHER hobby, this is core to my issue) and customer_id= 13 and retrieve these data:

    hobby_idea_article_id= 4,
    hobby_id = 9 //the hobby_id is DIFFERENT from the one above
    url= 'http://exo.example3.com',
    author = '[email protected]'//but the email recipient is still [email protected] !!!! 
    

As you see we have a peculiar situation as here the strategy of the row lock does not work : indeed I would like this second lambda NOT to grab this data because the author is the same ([email protected]), but it was NOT locked by the first SQL statement nor assigned currently_locked_for_emailing= true because the first SQL statement had a WHERE clause for hobby_id=4 ... but here it's a different hobby_id!!! so the row was never locked and so the row hobby_idea_article_id= 4 will be grabbed and i risk an email to the same recipient in a few milliseconds.

So I'm not sure how to do this but **maybe I need something like a combined row lock or maybe **two row locks**** (not sure how this would work) that would put a 'row lock' (until I update with currently_locked_for_emailing = true) to :

    1. first the rows which are the 'resulting rows of the SQL statement SELECT'
    1. but also ANY OTHER row of Hobby_ideas_articles that would have a similar 'author' value with ONE of the resulting rows of the SELECT On both 1. and 2 rows I'd apply the strategy of a Transaction and setting currently_locked_for_emailing to true (until the actual email is sent and i have persisted this fact on Past_Customer_sent_messages)

Is that the right approach ? How to do that in SQL ?

Disclaimer: I come from a Rails background where I used to have ORM (Active Record) making all the chains/joins/ easier more automagically work and am quite lost here with the present SQL complex statements


Solution

  • I must confess I haven't fully read your question as its huge but I kinda have an idea on what you're after. Is not the solution to separate the sending part from the SQL part? So create a new table called queue and insert all actions into a new table. You then run a separate cron/task which sends the emails as long as a particular user hasn't been contacted in the last X minutes. That way you can preserve a sense of uniqueness.