Search code examples
pythonnltkbleu

Bleu_score in NLTK library


I am new to using the nltk library. I want to find the two most similar strings. In doing so, I used the 'bleu_score' as follows:

import nltk
from nltk.translate import bleu
from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4```


C1 = 'FISSEN Ltds'
C2 = 'FISSEN Ltds Maschinen- und Werkzeugbau'
C3 = 'V.R.P. Baumaschinen Ltds'
print('BLEUscore1:',bleu([C1], C2, smoothing_function=smoothie, auto_reweigh=False))
print('BLEUscore2:',bleu([C2], C3, smoothing_function=smoothie, auto_reweigh=False))
print('BLEUscore3:',bleu([C1], C3, smoothing_function=smoothie, auto_reweigh=False))

The output is like this:

BLEUscore1: 0.2585784506653774
BLEUscore2: 0.26042143846335913
BLEUscore3: 0.1472821272412462

I wonder why the results show the best similarity between C2 and C3 while C1 and C2 are the best answers. And what is the best way to assess this similarity between two strings whose answer is C1 and C2?

I appreciate any help you can provide :)


Solution

  • You can try with SequenceMatcher;

    from difflib import SequenceMatcher
    
    C1 = 'FISSEN Ltds'
    C2 = 'FISSEN Ltds Maschinen- und Werkzeugbau'
    C3 = 'V.R.P. Baumaschinen Ltds'
    
    print(SequenceMatcher(None, C1, C2).ratio())
    print(SequenceMatcher(None, C2, C3).ratio())
    print(SequenceMatcher(None, C1, C3).ratio())
    
    # Output ->
    # 0.4489795918367347
    # 0.3548387096774194
    # 0.2857142857142857
    

    Hope this Helps...