Search code examples
phpmysqlrepeatlarge-data-volumes

How to query, then process a large quantity in somewhat short time frame


I have a large table of orders. At a set time, I need to send an SMS message (different content for each one unfortunately) to a large chunk of them (based on if they've opted for the message and if they included a phone number). It could be 200,000+ numbers in a fairly short time frame. (it's not really that high now, but theoretically could be and want to build as such).

They don't all have to send instantly, but - within 1-3 hours of the set time would be ideal.

I'm using a service with an API to send them, so that's not the problem - the problems are:

1) how to handle the large quantity

2) how to know they all got processed or re-process the ones that didn't

I don't think it would be a good idea to do a MySQL query to to get all 200,000+ phone numbers, then loop through - I'd have to assume that would suck up a lot of memory(?).

So - I thought to try a cron job, and have it run every minute (or so). In that script, I could pull maybe 5,000 records, mark as "in-process", and repeat through them processing each one then updating it's row to "sent".

But that has so many potential issues... what if the SMS service slows and I can't post them all. What if there's an unforeseen error and the script stops partway through... etc. If any number of things causes that script to not send all 5000 records, how would I know which ones to go back and re-process?

It's not only THIS process, but a few others we have - the overall question keeps coming up of how to handle a large number of rows that we need to process in some way and know that each one got done.

I hope I'm just over-complicating the crap out of this, and that there's some simpler way to handle.

If i'm not clear, please comment and I'd be glad to explain further about any aspect.


Solution

  • The short version:

    1. Don't worry about memory consumption. Just don't try to fetch the entire result-set at once.

    2. Your idea of using a separate table to list every text message and then updating the row when you know if it succeeded or not is generally the correct approach (whether you do it in cron or not doesn't really matter).

    3. If you are concerned that your SMS provider might drop some of your requests, then you can implement your own queuing mechanism using a ActiveMQ or something similar. However, that kind of defeats a large part of the purpose of using a provider. They should be using their own queue so that you do not need to worry about it.

    Details:

    The SMS service should be notifying you of success or failure. Most high-volume SMS services queue your messages, and send them out in chunks of n messages at a time. Then, they will notify you via some sort of callback or web hook, which messages have succeeded and which have failed. Most of them also provide API's that let you check if a certain message(s) has been sent or not. You need to leverage those features.

    I think you're on the right track with your cron approach.

    One option is to never "pull" the records. Instead, have a column on the existing table that specifies if it is waiting for a message to send or not. This way, instead of doing a SELECT and dealing with hundreds of thousands of rows, you do a simple UPDATE and then when each callback comes from the API, you can re-update the rows with success/failure.

    If you are in a situation where you might be sending multiple messages at a time for each row of data, then obviously this won't work. You will have to have a separate table with a row for every message that you want to track.

    As far as your memory concerns, I don't think it is an issue. Just don't fetch the entire result set. Instead, fetch each row individually. This will prevent mysql from returning the entire data set and thus you wont need to keep it in memory.

    From php.net

    As mysqli_fetch_all() returns all the rows as an array in a single step, it may consume more memory than some similar functions such as mysqli_fetch_array(), which only returns one row at a time from the result set. Further, if you need to iterate over the result set, you will need a looping construct that will further impact performance. For these reasons mysqli_fetch_all() should only be used in those situations where the fetched result set will be sent to another layer for processing.

    Edit / amendment

    to address the comment/question:

    I can't pull just one entry per chron - that would take forever... I understand I shouldn't fetch the entire result set at once too, that's what led me to ask "then how else can I do it?

    In PHP (using mysqli with mysqlnd), when you do a query, it doesn't actually return the data. It prepares the data to be returned based on your query, but it doesn't return it.

    When you use fetch_all you are asking for the entire result. When you use fetch_array, you are asking for the next result, and you are telling mysql to move the result cursor, so that you can get the next result after that. As long as you don't store every single result in memory (in separate variables), then there is no memory issue. Just use the row as you need to use it, and then get the next one. Whether it's a cron job or not doesn't matter.
    You do not need to call the script over-and-over, once for each row. The script handles every row in a single invocation. It just does it reading one-row-at-a-time, as to save memory.

    Here is a script example:

    $mysqli = new mysqli("host", "user", "pass", "db");
    $query = "SELECT * from TextMessages";
    $result = $mysqli->query($query);
    while ($row = $mysqli->fetch_array($result))
    {
        //this is the only thing you store in memory, one single row at a time
        $row = $result->fetch_array(MYSQLI_ASSOC);
    
        //go send the text message and do whatever else you need to do
        if ($row["SomeSmsToken"] == null && $row["TextHasAlreadyBeenSentOrDateSentOrWhatever"] == false)
        {
            //$someSmsToken = $myTwilioObject->SendByRow($row);
            //$this->UpdateRowToTellItThatItHasBeenSentToProviderAndIsWaitingForResponse($row,$someSmsToken);
            //..etc...
            //then go to the next row.
        }
    }
    $result->free();
    

    And then in some callback script you would do something like this.

    $mysqli = new mysqli("host", "user", "pass", "db");
    $query = "SELECT * from TextMessages where SomeSmsToken = '".$_POST["SomeTokenSentFromProviderInCallback"]."'";
    $result = $mysqli->query($query);
    while ($row = $mysqli->fetch_array($result))
    {
      $someObject->UpdateRowToSayThatTheTextWasSentOrItFailed($row,$_POST["SomeStatusSentFromProviderInCallback"]);
    }
    
    $result->free();
    

    You can also use mysqli_free_result when you're done, to free up any consumed memory by php's mysql driver.

    From php.net:

    You should always free your result with mysqli_free_result(), when your result object is not needed anymore.

    EDIT: If you want some sort of smart way to deal with "what if the script times out", I would suggest having a cron run every minute. When it runs it should check if it is already running, and, if it isn't already running, then you run it. The script will do it's work until it times out.
    Then within a minute, cron will start it again and, since it isn't running, it will run again and pick up wherever it left off.