Optimize SQL Queries on Big Table

I am working with a very large table - currently I have been tasked with reading all devices' logs into a database and running SELECTs to perform metrics. Current table definitions is as follows:

mysql> describe device_events;
+-------------+---------------------+------+-----+-------------------+-----------------------------+
| Field       | Type                | Null | Key | Default           | Extra                       |
+-------------+---------------------+------+-----+-------------------+-----------------------------+
| id          | bigint(20) unsigned | NO   | PRI | NULL              | auto_increment              |
| device_type | varchar(255)        | NO   | MUL | NULL              |                             |
| device_id   | bigint(20) unsigned | NO   | MUL | NULL              |                             |
| message     | json                | NO   |     | NULL              |                             |
| source      | text                | NO   | MUL | NULL              |                             |
| created_at  | timestamp           | NO   | MUL | CURRENT_TIMESTAMP |                             |
| updated_at  | timestamp           | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| file_date   | date                | YES  | MUL | NULL              |                             |
+-------------+---------------------+------+-----+-------------------+-----------------------------+```


Indexes:
+---------------+------------+---------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name                        | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+---------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| device_events |          0 | PRIMARY                         |            1 | id          | A         |    40932772 |     NULL | NULL   |      | BTREE      |         |               |
| device_events |          1 | device_events_device_id_index   |            1 | device_id   | A         |       44021 |     NULL | NULL   |      | BTREE      |         |               |
| device_events |          1 | device_events_device_type_index |            1 | device_type | A         |         621 |     NULL | NULL   |      | BTREE      |         |               |
| device_events |          1 | device_events_source_index      |            1 | source      | A         |        3085 |      255 | NULL   |      | BTREE      |         |               |
| device_events |          1 | device_events_created_at_index  |            1 | created_at  | A         |     2846551 |     NULL | NULL   |      | BTREE      |         |               |
| device_events |          1 | device_events_file_date_index   |            1 | file_date   | A         |       25017 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------------+------------+---------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

I was looking into partitioning and ideally would like to create a partition for each device and each log source but don't think I am able to do so due to MySQL's fixed partition ranges. Currently SELECTs take about a minute and takes about 15 minutes to generate all metrics but would like to speed this up. Does anybody have any other ideas on how to optimize large selects? I am expecting to have over a trillion records in the database. Most SELECTs will be done on events in the past 30 days and use JSON_EXTRACT on message. Note I am already using BETWEEN on the timestamp to avoid the month(created_at) calculation and have most likely optimized the queries as much as I can - I am primarily looking for structure optimization in regards to this question.

--
-- Table structure for table `device_events`
--

DROP TABLE IF EXISTS `device_events`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `device_events` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `device_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `device_id` bigint(20) unsigned NOT NULL,
  `message` json NOT NULL,
  `source` text COLLATE utf8mb4_unicode_ci NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `file_date` date DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `device_events_device_id_index` (`device_id`),
  KEY `device_events_device_type_index` (`device_type`),
  KEY `device_events_source_index` (`source`(255)),
  KEY `device_events_created_at_index` (`created_at`),
  KEY `device_events_file_date_index` (`file_date`)
) ENGINE=InnoDB AUTO_INCREMENT=42771939 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
/*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

-- Dump completed on 2021-04-19 17:55:35

Eloquent ORM query


    public function scopeLastMonth($query) {
        $lastMonth = Carbon::now()->subMonth();
        $start = $lastMonth->firstOfMonth()->startOfDay()->toDateTimeString();
        $end = $lastMonth->lastOfMonth()->endOfDay()->toDateTimeString();
        return $query->whereBetween("created_at", [$start, $end]);
    }

$topIdentitiesQuery = (clone $lastMonthEvents)->selectRaw("JSON_EXTRACT(message, '$.policy_identity') as identity")->selectRaw("count(*) as aggregate")->groupBy("identity")->orderBy("aggregate", "desc");
$topIdentities = [];
foreach($topIdentitiesQuery->take(self::NUM_TOP_IDENTITIES)->get() as $topIdentity) { 
    array_push($topIdentities, $topIdentity->identity);
}
$topIdentities = array_pad($topIdentities, self::NUM_TOP_IDENTITIES, "");

Solution

Are most queries for a given month? Or week or date range? Then I strongly recommend building and maintaining Summary Table(s) that summarize things for each day into a small number of rows. From such a table, it should take seconds, not minutes, to build a "report".

When you get to a trillion rows, a report ls likely to take days, since it will be I/O-bound! With a summary table, it might take only minutes.

Before you get to a trillion rows, you will need to figure out where to store the 200TB of raw data. Or, you need to perfect the summary tables and toss the raw data. Or, as a compromise, you can keep the raw data for the last, say, month, while keeping the summary tables 'forever'. This compromise may be useful if you need "details" for "recent" activity but can live with just daily summaries for "older" data.

If you go with that compromise, then PARTITION BY RANGE(TO_DAYS(..)); this makes the purging of 'old' data fast. If you don't go with the compromise, there is probably zero use fo partitioning.

Another benefit in moving toward summary tables as the primary source of information is that most of the indexes you now have on the "Fact" table can be dropped, thereby saving space and speeding up inserts.

It looks like you are extracting stuff from the JSON to produce a report? Well, that can be done as you build the summary table(s) and the fields can be put into real columns with indexes in there.

Have you estimated how fast you can insert the trillion rows? This may be a problem; let me know if you need some techniques to help with that.

Oh, get rid of any 3rd party package in the middle; you will need to do things directly with MySQL syntax and not be limited by the features that they implemented.

To save some space, normalize device_type. If there are hundreds, then a SMALLINT UNSIGNED is only 2 bytes and lets you have 64K values.

Do you have billions of device_id? If not, don't use an 8-byte BIGINT.

To save some space, consider compressing (in the client) the json and source. By the way, how big are those typically?

"Prefix" indexing (INDEX(source(255))) is rarely of any use.

And you probably don't need either created_at or updated_at. Keep in mind, an unnecessary timestamp column costs more than 5TB for your ultimate table!

Data Warehouse .. Summary Tables .. High speed ingestion .. Bulk normalization .. Partitioning