Search code examples
phpmysqljsonindexinginnodb

Best text searching logic on PHP and MySql index


Trying to design a indexing and searching system for my CMS.

What I had done till now is :

Three columns in Database Table (POST table) (InnoDB - utf8mb4_unicode_ci).

  1. Original content (JSON Format).

  2. Tokens (Getting all text from JSON, removing stopwords and then saving tokens to that column), also enabling indexing on that column.

  3. Tokens (directly coming from a tagging field, stackoverflow like tagging.), also indexing that field.

My CMS is working on that approach, but that increasing the index and table size due to tokens produced from original content.

Is there ant way to improve?

I am coming from JAVA background (was using Lucane). Its hard for me to make a dictionary of all stopword. If anybody know about pre made API os script for stopword removal.


Solution

  • For a tag system, like stack overflow or many other systems I would use 3 tables.

    • The main table with your json
    • A tags table
    • A junction or bridge table for a many to many relationship.

    With a separate tag table, you can search it using IN or just =. You can use the table as an auto-complete. You "normalize" your data, reducing it. etc... It does add a few Joins and a bit of complexity to the queries but I would say it's still fairly trivial.

    Also I have written a Text input plugin in Jquery to recognize tags within a text field. It's largely un-documented but your welcome to play around with it.

    https://github.com/ArtisticPhoenix/jQuery-Plugins/tree/master/jQuery-Plugins/jqWall

    It was for a small project where we needed a "twitter" like wall with hashtags, so it can be setup to recognize and autocomplete as a user types tags that start with # for example. Allowing a user to auto-complete text with the tags and reduce errors or misspelled tags from occurring. You can define what you want to match and it has a few callback for other functionality. It's pretty simple to use.

    You can see in this fiddle the basic setup ( it's pretty standard Many To Many )

    https://www.db-fiddle.com/f/2WSaLjtnuDrZE2CcVhAe9S/1

    Some other functionality this gives you is for example this query.

    SELECT
        t.*
    FROM
        tags AS t
    LEFT JOIN
        posts_tags AS pt ON t.id = pt.tag_id
    WHERE
        pt.tag_id IS NULL;
    

    Which will select all the tags not used in a post. As well as it's very easy to count how many posts have a given tag.

    SELECT
        count(p.id) AS total
    FROM
        posts AS p
    JOIN
        posts_tags AS pt ON p.id = pt.post_id
    JOIN
        tags AS t ON pt.tag_id = t.id
    WHERE
        tag = 'Programming';
    

    This might look complicated if your not used to how the relationships like this work and using Joins. But consider counting how many posts have it when you have a list of keywords. You will be using keywords LIKE '%Java%' and it will probably hurt your performance ( whenever you use %word preceding wildcards ). Also of note is if you have keywords like Javascript you will count it with this query and Java is not at all like Javascript so it can be an impossible task to accurately count how many posts you have using partial string matches.

    The only thing I would suggest if you do go with a single field is to use a delimiter, and include it to both the front and back of the contents so like this.

     |tag1|tag2|tag|tag10|
    

    The reason for this is then you can include the delimiter in your search, consider this partial query

    WHERE tag LIKE "%tag%"
    

    Now if you search for that you would find all the tags, but because we have our delimiters, you can do this

    WHERE tag LIKE "%|tag|%"
    

    Which would limit the match. But for it to work you need the delimiter at the start and end of the list. consider this tag1|tag10|tag without those we can't match |tag| with %|tag|% so it all falls apart. This would probably work ok for a system that has light use, but separate table are still superior, because of the nature of indexes and how they handle wild card searches.

    console.log("type any of these 'Java', 'JavaScript', 'Programming', 'PHP' and press enter to select from list, or use the mouse.");
    console.log("press enter to simulate submission, logs contents.");
    
    $('#test').jqWall({
      id: 'jqWall',
      autoComplete: {
        cache: ['Java', 'JavaScript', 'Programming', 'PHP'],
        match: /([^\s]+)$/, //$ ends with is required
        search : function(term, matches, callback){
            if(matches.lenght == 0){
            //get matches array from server by AJAX using `term`
            }
            callback( matches ); //requred
     
          },
      },
      submit : function(wrapper){
        //press enter to submit
        var contents = wrapper.find('textarea').val();
        //you could post contents back to server to save.
        console.log(contents);
      }
    });
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
    <script src="https://rawgit.com/ArtisticPhoenix/jQuery-Plugins/master/jQuery-Plugins/jqWall/jqWall.js"></script>
    <link href="https://rawgit.com/ArtisticPhoenix/jQuery-Plugins/master/jQuery-Plugins/jqWall/jqWall.css" rel="stylesheet" />
    <style type="text/css">
      #jqWall textarea {
        width: 400px;
        height: 200px;
      }
    </style>
    
    
    <div id="test" style=""></div>