Search code examples
phpmysqldatabase-designautocompletetagging

Dealing with tags


We are planning on using a tagging system similar to what is implemented on this very site.

We have the actual tagging front-end, and the autocomplete etc working.

But I am a but confused as to the best way to handle it on the back end.

Basically, when we get the tags on the backend we end up with an array that looks like:

array(
  array(
    'value' => 1,
    'label' => 'First Tag'
  ),
  array(
    'value' => 2,
    'label' => 'Second Tag'
  ),
  array(
    'value' => 'Third Tag',
    'label' => 'Third Tag'
  ),
  array(
    'value' => 3,
    'label' => 'Fourth Tag'
  ),
)

The tagging plugin also receives the same array format json_encode()'d via ajax when it autocompletes, it displays the label, and stores the id so it can send it back.

So, tags with the values 1,2,3 are tags that were selected from the autocomplete.
The tag with the value of Third Tag is one that was not selected from the autocomplete, and may, or may not already exist in the database but has been typed in manually.

Now there is the change that a user could actually create a tag that happens to be a number, hence

array(
  'value' => 3,
  'label' => 3
)

Could come through, but not already exist, so we can't just assume that if value is an int then it already exists.

So, the first part of this question is, how do I manage this so I do not end up with duplicate tags?

My current approach is, when when the tagging plugin requests tags via autocomplete I send back an array like

(term = 'pin')

array(
  array(
    'value' => '||1',
    'label' => 'pink'
  ),
  array(
    'value' => '||4',
    'label' => 'pin cushion'
  )
)

and then on the back end just assume that any tags that have a value starting with || have come from the autocomplete and already exist.

Then, we query the database for all tags, with the rest of the tags we check to find out if the value exists the array's label keys, if it does we just leave it as is, if it doesn't we create it, and then we switch out the value with the new id in the original array.

But that feels hacky to me, it means we are using a filler item (||) there must be a more elegant way of doing that?

The next part of the question is, actually linking these tags to an item. This is more in context of editing a question on this site,

Some tags are already linked with a question. how do you handle it so that you don't end up with duplicate tag references on a question?

I see two options so far: Remove all links to tags from the question, then insert them all again.(2 queries)
or
Query the database for all tags connected to the question, loop over the array removing those tags from the array, then insert the remainder. (2 queries)

Is either method better than the other? or is there a third version?


Solution

  • Any kind of duplicate key question could be resolved at the DB level by adding a unique constraint on the relevant fields. All of your codes interactions with the tags should be done using the text label, which should serve as the unique identifier for the tag. Any type of numeric ID serves no purpose for the application itself and therefore there is no need for it to peek out from behind the repository layer. This would also address differentiating existing/new tags...effectively the application doesn't care and it treats the tags as persisted Value Objects rather than worrying about any kind of Entity style life-cycle. In the repository call where the tag is associated with the article create the tag if it is not already present. The ID will primarily benefit performance wise when doing the JOIN(s) needed for tag queries (and really doesn't anywhere else), which is again something the application shouldn't care about outside of the repository which does the join.

    The safest and simplest bet for updating tags, including deletions, would be to blast out the existing tags and write new ones. This ensures that the persisted state matches the UI input completely and consistently, and realistically this would not be an expensive operation, nor would it be performed often enough to care about (though a simple programmatic check to see whether an update is needed would help prevent needless writes). It's 2 queries that should be wrapped in a transaction and could be batched together, and the DELETE in particular should be very cheap so long as the proper index is in place, so its not the kind of multiple query you need to worry about.

    If for some bizarre reason you were overly anxious on minimizing the work of the database, you could store a version of the tags beforehand and then afterwards and do the queries appropriate for the delta, but this is far more fragile, and could also introduce many complex concurrency concerns.