How do I determine the weight? depending on what?

I'm trying to calculate the n--gram using Python. The weight I used for for uni-gram, bi-gram, tri-gram, and 4-gram is (0.25, 0.25, 0, 0).

When I run the script for the first reference it gives me a BLEU score 0.51

the script is:

# Define your desired weights (example: higher weight for bi-grams)
weights = (0.25, 0.25, 0, 0)  # Weights for uni-gram, bi-gram, tri-gram, and 4-gram

# Reference and predicted texts (same as before)
reference = [["the", "alleyway", "barely", "lives", "in", "semi", "isolation"]]
predictions = ["midaq", "alley", "lives", "in", "almost", "complete", "isolation"]

# Calculate BLEU score with weights
score = sentence_bleu(reference, predictions, weights=weights)
print(score)

But when I run the same script for the second reference it gives a BLEU score 6.91

The script is:

# Define your desired weights (example: higher weight for bi-grams)
weights = (0.25, 0.25, 0, 0)  # Weights for uni-gram, bi-gram, tri-gram, and 4-gram

# Reference and predicted texts (same as before)
reference = [["the", "alley", "is", "almost", "living", "in", "a", "state", "of", "isolation"]]
predictions = ["midaq", "alley", "lives", "in", "almost", "complete", "isolation"]

# Calculate BLEU score with weights
score = sentence_bleu(reference, predictions, weights=weights)
print(score)

Why does it show this big difference although the weight and the code is the same? How do I determine the weight? Are there any specific criteria?

Solution

As mentioned here:

Only big differences in metric scores are meaningful in MT

If System A has a BLEU score that is 1-2 point higher than System B (common in academic papers), then there is only a 50% chance that human evaluators will prefer System A over System B

If System A has a BLEU score that is 3-5 points higher than System B, there is a 75% chance that human evaluators will prefer A over B.

In order to get a 95% chance that human evaluators will prefer A over B, we need something like a 10 point improvement in BLEU (they dont state this, I am guessing this by eyeballing their graphs).

So a difference of 6.4 is acceptable.

You have quite a different input data, which is already quite small. So of course, the weights are different.