Business logic:
My application crawls a lot(hundreds or sometimes thousands) of webpages every few hours and stores all the links(i.e. all anchor tags) on that webpage in a MySQL database table say links
. This table is growing very big day by day (already around 20 million records as of now).
Technical:
I have a unique index combined on [webpage_id
, link
] in the links
table. Also, I have a column crawl_count
in the same table.
Now whenever I crawl a webpage, I already know webpage_id
(the foreign key to webpages
table) and I get links in that webpage (i.e. array of link
) which I just do an insert or update query without worrying about what is already in the table.
INSERT INTO ........ ON DUPLICATE KEY UPDATE crawl_count=crawl_count+1
Problem:
The table grows big every day & I want to optimize the table for performance. Options I considered are,
webpages
table to something like{webpage: "http://example.com/new-brands", links: [link1, link2, link3]}
and migrate this table to DynamoDB so I don't have a links
table. But again, there is a limit for every record in DynamoDB(400kb). What if it exceeds this limit?
I have read pros & cons of using either of the approach. As far my understanding goes, DynamoDB doesn't seem to be a good fit for my situation. But still wanted to post this question so I can make a good decision for this scenario.
PARTITION BY domain
-- No. There won't be any performance gain. Anyway, you will find that one domain dominates the table, and a zillion domains show up only once. (I'm speaking from experience.)
The only concept of an "array" is a separate table. It would have, in your case, webpage_id
and link
as a 2-column PRIMARY KEY
(which is 'unique').
Normalize. This is to avoid having lots of copies of each domain and each link. This saves some space.
I assume you have two categories of links -- the ones for pages you have scanned, and the ones for pages waiting to scan. And probably the two sets are similar in size. I don't understand the purpose of `crawl count, but it adds to the cost.
I may be able to advise further if I could see the queries -- both inserting and selecting. Also, how big are the tables (GB) and what is the value of innodb_buffer_pool_size
? Putting these together, we can discuss likely points if sluggishness.
Also the slowlog would help.
Are you dealing with non-ascii urls? Urls too long to index? Do you split urls into domain + path? Do you strip off "#..."? And "?..."?