Have a Golang utility which is able to decrease data points per day in a table with historic data.
The records range from 20 to 400 records per day. totally there are a minimum of 100 million records.
The utility is able to trim it down to n records per day prior to a given date. (n can range from 1 to 300 records per day)
The method I am using is as follows:
STEP 1:
CREATE TABLE main_table_tmp LIKE main_table;
STEP 2:
ALTER TABLE main_table_tmp ADD COLUMN timekey INT;
STEP 3:
INSERT INTO main_table_tmp
SELECT * FROM (
SELECT *,FLOOR(UNIX_TIMESTAMP(column_name)/((1440/2)*60)) AS timekey
FROM main_table
WHERE column_name <= '2018-01-01'
GROUP BY timekey
) m
UNION ALL
(SELECT * ,0 As timekey FROM main_table where column_name > 'date') ;
STEP 4:
ALTER TABLE main_table_tmp DROP COLUMN timekey;
DROP TABLE maintable;
RENAME TABLE maintable_tmp TO maintable;
I am achieving the above using golang.
func somefuncname(){
----
----
----
q := "CREATE TABLE " + *tablename + "_tmp LIKE " + *tablename + ";"
rows, err := db.Query(q)
if err != nil {
fmt.Println(err)
}
//--ALTER ADD timekey
//--INSERT INTO SELECT *....
//--ALTER DROP timekey ,DROP table and rename
}
The current response time of this query is very slow
Some of the Results:
Total Records : 2 million
Execution Time: 180 seconds
This is on a 16Gb RAM CPU It is very slow when it is deployed on a low grade system
Steps I have took to resolve this:
Looked into indexes of all the tables. Tried removing the index and running the utility. Removing indexes made the utility faster by 5 seconds which is also not much.
Executed the utility in stages: if total records crosses more than 1 million then run the utility 1 million at a time
But after all these efforts looks like the main problem is in the query itself.
It is just not fast enough. I just need a way to increase the efficiency of the query
Any help appreciated, thank you guys!!
Why are we adding timekey
and then dropping it? Adding it to an empty table is fast, but dropping it from a table after it's populated, that's like an extra copy of the table. That's unnecessary work, if we don't need it.
We can do a GROUP BY
on an expression; that expression doesn't have to appear in the SELECT list., for example:
SELECT t.*
FROM main_table t
WHERE t.column_name <= '2018-01-01'
GROUP
BY FLOOR(UNIX_TIMESTAMP(t.column_name)/((1440/2)*60))
(Note that this query will cause an error if ONLY_FULL_GROUP_BY
is included in sql_mode; that disables a MySQL-specific extension which allows the query to run.)
Without some table definitions (including storage engine, column datatypes, indexes) and without EXPLAIN output, we're just guessing.
But some suggestions:
Drop the secondary indexes on the empty table being populated, and add them after the table is loaded.
I'd avoid the UNION. Given that one of the SELECT statements has a predicate on column_name
and the other has a predicate on an entirely different column date
, we do want to separate SELECT statements.
CREATE TABLE main_table_tmp LIKE main_table
;
-- for performance, remove secondary indexes, leave just the cluster index
ALTER TABLE main_table_tmp
DROP INDEX noncluster_index_1
, DROP INDEX noncluster_index_2
, ...
;
-- for performance, have a suitable index available on main_table
-- with `column_name` as the leading column
INSERT INTO main_table_tmp
SELECT h.*
FROM main_table h
WHERE h.column_name <= '2018-01-01'
GROUP
BY FLOOR(UNIX_TIMESTAMP(h.column_name)/((1440/2)*60))
;
-- for performance, have a suitable index available on main_table
-- with `date` as the leading column
INSERT INTO main_table_tmp
SELECT c.*
FROM main_table
WHERE c.date > '????-??-??'
;
-- add secondary indexes
ALTER TABLE maint_table_tmp
ADD UNIQUE INDEX noncluster_index_1 (fee,fi,fo)
, ADD INDEX noncluster_index_2 (fum)
, ...
;