mysql amazon-web-services nosql amazon-dynamodb database-partitioning

MySQL partitioning or NoSQL like AWS DynamoDB?

Business logic:

My application crawls a lot(hundreds or sometimes thousands) of webpages every few hours and stores all the links(i.e. all anchor tags) on that webpage in a MySQL database table say links. This table is growing very big day by day (already around 20 million records as of now).

Technical:

I have a unique index combined on [webpage_id, link] in the links table. Also, I have a column crawl_count in the same table. Now whenever I crawl a webpage, I already know webpage_id (the foreign key to webpages table) and I get links in that webpage (i.e. array of link) which I just do an insert or update query without worrying about what is already in the table.

INSERT INTO ........ ON DUPLICATE KEY UPDATE crawl_count=crawl_count+1

Problem:

The table grows big every day & I want to optimize the table for performance. Options I considered are,

Partitioning: Partition table by domains. All webpages belong to a particular domain. For example: Webpage https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals belong to the domain https://www.amazon.in/
NoSQL like DynamoDB. I have other tables of application in MySQL DB which I do not want to migrate to DynamoDB unless it's absolutely required. Also I have considered change in application logic (eg: change the structure of webpages table to something like

{webpage: "http://example.com/new-brands", links: [link1, link2, link3]}

and migrate this table to DynamoDB so I don't have a links table. But again, there is a limit for every record in DynamoDB(400kb). What if it exceeds this limit?

I have read pros & cons of using either of the approach. As far my understanding goes, DynamoDB doesn't seem to be a good fit for my situation. But still wanted to post this question so I can make a good decision for this scenario.

Solution

PARTITION BY domain -- No. There won't be any performance gain. Anyway, you will find that one domain dominates the table, and a zillion domains show up only once. (I'm speaking from experience.)

The only concept of an "array" is a separate table. It would have, in your case, webpage_id and link as a 2-column PRIMARY KEY (which is 'unique').

Normalize. This is to avoid having lots of copies of each domain and each link. This saves some space.

I assume you have two categories of links -- the ones for pages you have scanned, and the ones for pages waiting to scan. And probably the two sets are similar in size. I don't understand the purpose of `crawl count, but it adds to the cost.

I may be able to advise further if I could see the queries -- both inserting and selecting. Also, how big are the tables (GB) and what is the value of innodb_buffer_pool_size? Putting these together, we can discuss likely points if sluggishness.

Also the slowlog would help.

Are you dealing with non-ascii urls? Urls too long to index? Do you split urls into domain + path? Do you strip off "#..."? And "?..."?