I'm building a predictive dialer where speed is critical. To dial the number I pull customer information from tables and build call files for a pbx to act on.
Currently I have a table for each area code, and we dial one area code at a time, but we are switching to a model wherein we dial based on Territories which span multiple Zip codes. Some area codes exist in multiple zip codes. Each table has new numbers added monthly, and is scrubbed by comparing against a do-not-call list of several million numbers.
So my question is, how should I most efficiently organize this data?
One large table seems counter productive, as we're talking millions of records of scrubbed data.
My current line of reasoning is to maintain the area code tables for importing and scrubbing, then copying the scrubbed records over to territory tables, created by searching the area code tables for the zip codes in the area.
I currently index the tables by an auto_incremented INT primary key, a unique phone number, and a status which tracks numbers that have already been called, or are on the do-not-call list. When building the call file, I mark the record as queued, then mark it according to how the call goes once it has completed, so for each call, there is a search and two updates.
The search looks in the area-code table for a certain status. The updates occur based on the record ID.
The meat of the question is this: Would it be quicker to organize by zip codes and search by status, or keep them organized by area code and search by status AND zip code? Or would the better bet be to create a new table each time we set up a territory built from the area code tables?
Forgive me if this seems like a dumb question, I've been teaching myself SQL as I've been building this, and the nuances of database design and performance are a bit beyond my skill set.
The total size of the tables is at 2 million rows and growing.
The meat of the question is this: Would it be quicker to organize by zip codes and search by status, or keep them organized by area code and search by status AND zip code? Or would the better bet be to create a new table each time we set up a territory built from the area code tables?
Answer: don't do any of these unless you really know what you're doing. Instead, create one table to hold all rows of this entity, using column values to distinguish between the various zip codes and territories. Possibly create zipcodes
and territory
tables, and add foreign keys referencing them.
Creating separate tables based on attribute value is not a typical solution, and will introduce many additional difficulties (for example, if you organize into tables by zip code, how do you search by territory across all zip codes?)
The more common solution, and one which databases excel at, is to use indexes. Using multiple indexes, a database can provide fast access to a table for searches on multiple different columns.
So the basic strategy I would recommend:
explain <query>
is very handyIt's also important to note that two million rows is not a huge amount for MySQL (although of course, this depends on load). The bottom line is that optimization is a very tricky subject whose answer depends on your specific situation.