What I'm trying to achieve:
I have been looking for an approach for a long while now but I'm not able to find an (effective) way to this:
What I tried:
Python:
nltk
in combination with gensim
(as far as I could code and read it was only capable to use word similarity (but not taking order into
account).
R:
used tm
to build a TermDocumentMatrix
which looked really promising but was not able to map anything to this matrix. Further this TermDocumentMatrix
seems to take the order into account but misses the synonyms (I think).
I know the lemmatization didn't go that well hahah :)
Question:
Is there any way to do achieve the steps described above using either R or Python? A simple sample code would be great (or references to a good tutorial)
There are many ways to do what you described above, and it will of course take lots of testing to find an optimized solution. But here is some helpful functionality to help solve this using python/nltk.
build a model from example sentences while taking word order and synonyms into account.
1. Tokenization
In this step you will want to break down individual sentences into a list of words.
Sample code:
import nltk
tokenized_sentence = nltk.word_tokenize('this is my test sentence')
print(tokenized_sentence)
['this', 'is', 'my', 'test', 'sentence']
2. Finding synonyms for each word.
Sample code:
from nltk.corpus import wordnet as wn
synset_list = wn.synsets('motorcar')
print(synset_list)
[Synset('car.n.01')]
Feel free to research synsets if you are unfamiliar, but for now just know the above returns a list, so multiple synsets are possibly returned.
From the synset you can get a list of synonyms.
Sample code:
print( wn.synset('car.n.01').lemma_names() )
['car', 'auto', 'automobile', 'machine', 'motorcar']
Great, now you are able to convert your sentence into a list of words, and you're able to find synonyms for all words in your sentences (while retaining the order of your sentence). Also, you may want to consider removing stopwords and stemming your tokens, so feel free to look up those concepts if you think it would be helpful.
You will of course need to write the code to do this for all sentences, and store the data in some data structure, but that is probably outside the scope of this question.
map a sentence against this model and get a similarity score (thus a score indicating how much this sentence fits the model, in other words fits the sentences which were used to train the model)
This is difficult to answer since the possibilities to do this are endless, but here are a few examples of how you could approach it.
If you're interested in binary classification you could do something as simple as, Have I seen this sentence of variation of this sentence before (variation being same sentence but words replaced by their synonyms)? If so, score is 1, else score is 0. This would work, but may not be what you want.
Another example, store each sentence along with synonyms in a python dictionary and calculate score depending on how far down the dictionary you can align the new sentence.
Example:
training_sentence1 = 'This is my awesome sentence'
training_sentence2 = 'This is not awesome'
And here is a sample data structure on how you would store those 2 sentences:
my_dictionary = {
'this': {
'is':{
'my':{
'awesome': {
'sentence':{}
}
},
'not':{
'awesome':{}
}
}
}
}
Then you could write a function that traverses that data structure for each new sentence, and depending how deep it gets, give it a higher score.
Conclusion:
The above two examples are just some possible ways to approach the similarity problem. There are countless articles/whitepapers about computing semantic similarity between text, so my advice would be just explore many options.
I purposely excluded supervised classification models, since you never mentioned having access to labelled training data, but of course that route is possible if you do have a gold standard data source.