I'm trying to compare two different texts—one coming from a Curriculum Vitae (CV) and the other from a job announcement.
After cleaning the texts, I'm trying to compare them to detect if a job announcement is more linked to a specific CV.
I am trying to do this using similarity matching in spaCy via the following code:
similarity = pdf_text.similarity(final_text_from_annonce)
This works well, but I'm getting strange results from two different CVs for the same job announcement. Specifically, I get the same similarity score (~0.6), however, one should clearly be higher than the other.
I checked on spaCy website and I found this very important sentence:
Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.
So, what do I need to use or code to make spaCy compare my two texts based on their meaning instead of the occurrence of words?
I am expecting a parameter for the similarity
function of spaCy, or another function that will compare my both texts and calculate a similarity score based on the meaning of the texts and not if the same words are used.
The spaCy library by default will use the average of the word embeddings of words in a sentence to determine semantic similarity. This can be thought of as a naive sentence embedding approach. Such an approach could work, but if you were to use it is recommended that you first filter non-meaningful words (e.g. common words) to prevent them from undesirably influencing the final sentence embeddings.
The alternative (and more reliable) solution is to use a different pipeline within spaCy that has been designed to use sentence embeddings created specifically with a dedicated sentence encoder (e.g. the Universal Sentence Encoder (USE) [1] by Cer et al.). Martino Mensio created a package called spacy-universal-sentence-encoder that makes use of this model. Install it via the following command in your command prompt:
pip install spacy-universal-sentence-encoder
Then you can compute the semantic similarity between sentences as follows:
import spacy_universal_sentence_encoder
# Load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')
# Create two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# Use the similarity method to compare the full documents (i.e. sentences)
print(doc_1.similarity(doc_2)) # Output: 0.9356049733134972
# Or make the comparison using a predefined span of the second document
print(doc_1.similarity(doc_2[0:7])) # Output: 0.9739387861159459
As a side note, when you run the nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')
command for the first time, you may have to do so with administrator rights to allow TensorFlow to create the models
folder in C:\Program Files\Python310\Lib\site-packages\spacy_universal_sentence_encoder
and download the appropriate model. If you don't, it is possible that there will be a PermissionDeniedError
and the code will not run.
[1] Cer, D., Yang, Y., Kong, S.Y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C. and Sung, Y.H., 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.