Trying to design a indexing and searching system for my CMS.
What I had done till now is :
Three columns in Database Table (POST table) (InnoDB - utf8mb4_unicode_ci).
Original content (JSON Format).
Tokens (Getting all text from JSON, removing stopwords and then saving tokens to that column), also enabling indexing on that column.
Tokens (directly coming from a tagging field, stackoverflow like tagging.), also indexing that field.
My CMS is working on that approach, but that increasing the index and table size due to tokens produced from original content.
Is there ant way to improve?
I am coming from JAVA background (was using Lucane). Its hard for me to make a dictionary of all stopword. If anybody know about pre made API os script for stopword removal.
For a tag
system, like stack overflow or many other systems I would use 3 tables.
With a separate tag table, you can search it using IN
or just =
. You can use the table as an auto-complete. You "normalize" your data, reducing it. etc... It does add a few Joins and a bit of complexity to the queries but I would say it's still fairly trivial.
Also I have written a Text input plugin in Jquery to recognize tags within a text field. It's largely un-documented but your welcome to play around with it.
https://github.com/ArtisticPhoenix/jQuery-Plugins/tree/master/jQuery-Plugins/jqWall
It was for a small project where we needed a "twitter" like wall with hashtags, so it can be setup to recognize and autocomplete as a user types tags that start with #
for example. Allowing a user to auto-complete text with the tags and reduce errors or misspelled tags from occurring. You can define what you want to match and it has a few callback for other functionality. It's pretty simple to use.
You can see in this fiddle the basic setup ( it's pretty standard Many To Many )
https://www.db-fiddle.com/f/2WSaLjtnuDrZE2CcVhAe9S/1
Some other functionality this gives you is for example this query.
SELECT
t.*
FROM
tags AS t
LEFT JOIN
posts_tags AS pt ON t.id = pt.tag_id
WHERE
pt.tag_id IS NULL;
Which will select all the tags not used in a post. As well as it's very easy to count how many posts have a given tag.
SELECT
count(p.id) AS total
FROM
posts AS p
JOIN
posts_tags AS pt ON p.id = pt.post_id
JOIN
tags AS t ON pt.tag_id = t.id
WHERE
tag = 'Programming';
This might look complicated if your not used to how the relationships like this work and using Joins. But consider counting how many posts have it when you have a list of keywords. You will be using keywords LIKE '%Java%'
and it will probably hurt your performance ( whenever you use %word
preceding wildcards ). Also of note is if you have keywords like Javascript
you will count it with this query and Java
is not at all like Javascript
so it can be an impossible task to accurately count how many posts you have using partial string matches.
The only thing I would suggest if you do go with a single field is to use a delimiter, and include it to both the front and back of the contents so like this.
|tag1|tag2|tag|tag10|
The reason for this is then you can include the delimiter in your search, consider this partial query
WHERE tag LIKE "%tag%"
Now if you search for that you would find all the tags, but because we have our delimiters, you can do this
WHERE tag LIKE "%|tag|%"
Which would limit the match. But for it to work you need the delimiter at the start and end of the list. consider this tag1|tag10|tag
without those we can't match |tag|
with %|tag|%
so it all falls apart. This would probably work ok
for a system that has light use, but separate table are still superior, because of the nature of indexes and how they handle wild card searches.
console.log("type any of these 'Java', 'JavaScript', 'Programming', 'PHP' and press enter to select from list, or use the mouse.");
console.log("press enter to simulate submission, logs contents.");
$('#test').jqWall({
id: 'jqWall',
autoComplete: {
cache: ['Java', 'JavaScript', 'Programming', 'PHP'],
match: /([^\s]+)$/, //$ ends with is required
search : function(term, matches, callback){
if(matches.lenght == 0){
//get matches array from server by AJAX using `term`
}
callback( matches ); //requred
},
},
submit : function(wrapper){
//press enter to submit
var contents = wrapper.find('textarea').val();
//you could post contents back to server to save.
console.log(contents);
}
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://rawgit.com/ArtisticPhoenix/jQuery-Plugins/master/jQuery-Plugins/jqWall/jqWall.js"></script>
<link href="https://rawgit.com/ArtisticPhoenix/jQuery-Plugins/master/jQuery-Plugins/jqWall/jqWall.css" rel="stylesheet" />
<style type="text/css">
#jqWall textarea {
width: 400px;
height: 200px;
}
</style>
<div id="test" style=""></div>