Search code examples
mysqldatabasesearch-enginedatabase-normalization

Which database structure to use for website search engine


EDIT - Database Design C

Based on @duskwuff's solution. I believe this is also what is referred to as a 2NF normalised database design.

My concern now is over the fact that there are duplicate column values in this design (the same Keyword like Apricot appears in multiple rows). As more values are added, will the database having too many rows impact speed or performance?

enter image description here enter image description here


Original post

I am building a website search engine using PHP and mySQL. The goal is to search through a fruits catalogue with the ability to find fruits based on various search terms by the user.

For example, if a user searches for 'Fruit that starts with a', Apples, Apricots etc should come out. If a user searches for 'Fruits with a sweet taste', sweet tasting fruits like Grape, Cranberry, Apricot etc should come out.

In Design A, we have all the fruit names listed as columns. Under the columns would be the search terms that would lead into the fruit. So under Apple, you would find column values: 'starts with a' and 'sweet'

In Design B, we have the fruit names listed instead as column values under a column titled Keyword. The search terms for each respective fruits are the respective column values of SearchTerm1, SearchTerm2, and so on.

Which database design would perform better in terms of speed and scalability?


Database Design A

enter image description here enter image description here


Database Design B

enter image description here enter image description here


Solution

  • You have a many to many relationship. One fruit can have many associated search terms and one search term may have many fruits associated with it.

    Ex: {Apricot -> Sweet taste, starts with a, orange} and {Sweet taste -> apricot, mango} etc.

    The way to do many to many relationship in databases is to introduce a table in the middle.

    Let there be a fruit table which has all the fruits. Introduce a synthetic primary key in both tables (if they don't have a naturally occurring primary key) to uniquely identify each row. Then create the middle table and for every pair of <fruit, search term> add a row in it. Create an index on the <pk1, pk2> of this table to keep it as fast as possible.

    I found the following image after googling. It explains the idea.

    enter image description here

    In this table, a brand of beer can have multiple distributors and a distributor can be distributing multiple brands of beer. So we have the middle table beer_distributor which is essentially a table of the primary keys of the other two tables viz. beer and distributor.