Currently I have around 900,000 entries in the data_article_key_terms table to associate key terms to their respective articles. The goal is to be able to select an arbitrary date range and display the top 15 key terms based on the articles in that date range.
The problem that I'm running in to is that the query that I'm running takes almost 6 seconds, but I need it to be faster than that. I realize that this is relative based on the system that I'm running on and I could use a machine with more power, but I'm trying to optimize it the best I can before I go that route.
I'm using InnoDB as the MySQL storage engine to preserve data integrity. As I understand it MyISAM is faster with count(*), but using that engine is also not an option.
I've also considered storing the key term counts in a table based on fixed time ranges, but that ends up being a lot of data to store and keep track of.
Does anyone have a good suggestion on how to optimize this experience?
I have the following tables:
This table stores article information:
CREATE TABLE `data_article` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`connection_id` int(11) NOT NULL,
`folder_id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`uid` varchar(100) NOT NULL,
`date` date NOT NULL,
`influencer_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `data_article_5930b15a` (`connection_id`),
KEY `data_article_4e5f642` (`folder_id`),
KEY `data_article_fbfc09f1` (`user_id`),
KEY `data_article_43ae76a1` (`influencer_id`),
KEY `data_article_date` (`date`),
CONSTRAINT `connection_id_refs_id_b2ae9152` FOREIGN KEY (`connection_id`) REFERENCES `account_connection` (`id`),
CONSTRAINT `folder_id_refs_id_e343586a` FOREIGN KEY (`folder_id`) REFERENCES `account_folder` (`id`),
CONSTRAINT `influencer_id_refs_id_45cd3615` FOREIGN KEY (`influencer_id`) REFERENCES `data_influencer` (`id`),
CONSTRAINT `user_id_refs_id_aca13cc9` FOREIGN KEY (`user_id`) REFERENCES `auth_user` (`id`)
)
This table stores key terms:
CREATE TABLE `data_keyterm` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`term` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
KEY `data_keyterm_term` (`term`)
)
This table stores the relationship between articles and key terms:
CREATE TABLE `data_article_key_terms` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`article_id` int(11) NOT NULL,
`keyterm_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `article_id` (`article_id`,`keyterm_id`),
KEY `data_article_key_terms_30525a19` (`article_id`),
KEY `data_article_key_terms_1d848ca4` (`keyterm_id`),
CONSTRAINT `article_id_refs_id_d87be8f5` FOREIGN KEY (`article_id`) REFERENCES `data_article` (`id`),
CONSTRAINT `keyterm_id_refs_id_50d233f8` FOREIGN KEY (`keyterm_id`) REFERENCES `data_keyterm` (`id`)
)
This table stores influencers that are associated with the articles:
CREATE TABLE `data_influencer` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
`title` varchar(100) NOT NULL,
`email` varchar(100) NOT NULL,
`active` tinyint(1) NOT NULL,
`user_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `data_influencer_fbfc09f1` (`user_id`),
KEY `data_influencer_name` (`name`),
CONSTRAINT `user_id_refs_id_b1bb5d4f` FOREIGN KEY (`user_id`) REFERENCES `auth_user` (`id`)
)
This is the SQL statement I'm using to pull the keywords based on a time range, group them, and order them by frequency:
SELECT dk.id, dk.term as term, COUNT(dk.id) as count
FROM data_keyterm dk
INNER JOIN data_article_key_terms dakt ON dakt.keyterm_id = dk.id
INNER JOIN data_article da ON da.id = dakt.article_id
INNER JOIN data_influencer di ON di.id = da.influencer_id
WHERE da.user_id = 1
AND da.date between '2010-08-07' AND '2012-08-07'
AND di.active = True
GROUP BY dk.id
ORDER BY count DESC
LIMIT 15;
Running the inner join with a table with 900,000 records and 3 inner join will be take some time to execute. I think you should try some external search engines like solar to obtain the results in quick time