Search code examples
databaseocrkofax

Search with a Database Locator on a single column with correct confidence?


I've been using Kofax Transformation Modules for 3 years and I'm still not sure how the Database Locator works.

I have a very simple database, with a bunch of columns. I have a very simple PDF document, OCR is done.

I want to retrieve one record from the database, based on the value of a single column. So if a value from this column is found on the document, I want the database locator to return the corresponding record, with 100% confidence (or whatever the OCR confidence is for this single value).

And last but not least, I want this confidence to work with the "minimum confidence" I set in the database locator's properties (general tab).

But it doesn't seem possible.
See, my PDF document contains a value, read by OCR, that is a 100% match for the database column.
The locator returns the record with a so-called 100% confidence, since I set the search mask on that single column.

But if I put the minimum confidence to anything higher than 34%, the record is not returned.

Why is that? How can it be fixed?
Do I really have to use a script locator to do what I want, which doesn't seem that complicated?


Solution

  • Unintuitive Confidence Value

    When the Database Locator runs, it tries to find find records that best match the document OCR. The key to the behavior you see is that it first does the actual fuzzy search, returning records that meet the minimum confidence, and then the locator itself does additional processing: increasing or decreasing the confidence of records depending on whether they meet the Fields, Search Masks, or Regions settings defined in the locator.

    The up side to this behavior is memory and processing efficiency. The core fuzzy search index can quickly determine which records that meet the initial confidence threshold, then the Database Locator only needs to load those into memory and do the post processing. The alternative would be that all records would need to be loaded to do post processing just in case the post processing pushed the confidence above the threshold. That would be more intuitive, but less efficient.

    Possible Configuration Improvement

    If you only intend to search on that one column, and the other columns are just data you want returned, then make sure that column is the only one indexed. When you open the properties of the database, it shows the field names with checkboxes. Any fields that are checked are indexed and are part of what the locator will try to find on the document. You could be getting lower confidence if you have a bunch of fields checked that aren't actually on the document, especially if you also have a non-zero value for the locator setting "Penalty for empty fields".

    When using KSMS, the indexed columns cannot be changed in Project Builder since KSMS is building and serving the index. Instead open the Import Settings of the database within KSMS Administration and see that there is a "Columns to Use" section of checkboxes. If you configured the database by uploading a file rather than pointing to a UNC path, you will need to upload it again to be able to change which columns were indexed.

    Context

    For anyone reading this as a traditional database question: A "database" in this context in KTM takes records from CSV or relational database and indexes them for fuzzy matching. This core fuzzy search functionality is used in a few ways, one of which is the Database Locator.

    Documentation mentioning Database Locator processing separately from fuzzy search: The scripting help topic "Database Lookup in Specific Columns" shows how to use the fuzzy search from script (From a script window: Help>Scripting Help, then Script Samples > Database Lookup in Specific Columns), but it also has some mention of the fuzzy search itself as separate from some of the other settings handled by the Database Locator.