Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.

Suppose I have a table called : DOCTORS_DETAILS

Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.

Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ? Expected output can be,

DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string

PATIENT_DETAILS - The term DETAILS present in both string

HOSPITAL_DEPARTMENTS - least relevant

PAYMENTS - least relevant

Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.

Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)

Solution

Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:

Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar

Three possible ways to do step 1 are:

To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:

import numpy as np

def cos_sim(a, b):
    """Takes 2 vectors a, b and returns the cosine similarity according 
    to the definition of the dot product
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),

Then you'd then want to do something like:

import fasttext

model = fasttext.load_model("model_filename.bin")


def order_tables_by_name_similarity(main_table, candidate_tables):
    '''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
    you can modify this to also output the distances if you need them
    '''
    main_v = model[main_table]
    similarity_to_main = lambda w: cos_sim(main_v, model[w])
    return sorted(candidate_tables, key=similarity_to_main, reverse=True)

order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])

# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']

If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!