Search code examples
machine-learningdeep-learningcluster-analysis

Is it possible to use JSON format input for BERT model?


I am trying to create one knowledge base (single source of truth) gathered from multiple web sources. (ex. wiki <-> fandom)

So I want to try a Siamese network or calculate cosine similarity with BERT embedded documents.

Then, can I ignore those json structures and train them anyway?


Solution

  • Although BERT wasn't specifically trained to find similarity between JSON data, you could always extract and concatenate the values of your JSON into a long sentence and leave it to BERT to capture the context as you expect.

    Alternatively, you could generate a cosine similarity score for each key-value dependency between the JSONs and aggregate them to generate a net similarity score for the JSON data pair.

    Also, see Sentence-BERT (SBERT), a modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.