machine-learningnlpdata-scienceevaluationsummarization# In the ROUGE metrics, what do the low, mid and high values mean?

## Randomness in ROUGE

The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].

When calculating any ROUGE metric you get an `AggregateScore`

object with 3 parameters: `low`

, `mid`

, `high`

.
How are these aggregate values calculated?

For example, from the huggingface implementation [2]:

```
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0
```

Edit: On July 7th, the huggingface implementation was simplified to return a cleaner and easier to understand dict: https://github.com/huggingface/evaluate/issues/148

Solution

Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding `use_aggregator=False`

and get these values returned.

For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for `n`

samples, you draw `x`

times a sample with replacement of size `n`

, and then calculate some statistic for each resample. Now you get a new distribution called the `empirical bootstrap distribution`

, which can be used to extract confidence intervals.

In the ROUGE implementation by google [4], they used:

`n`

for the number of resamples to run`mean`

for the resample statistic`2.5th, 50th and 97.5th percentiles`

to calculate the values for low, mid and high, respectively (can be controlled with the`confidence_interval`

param)

Note that due to the bootstrapping technique used in ROUGE, it is non-deterministic, and can return different results for each run (see [5]). If you don't want to opt out from using the bootstrapping technique, you can set the seed in the load function, as such: `evaluate.load('rouge', seed=42)`

.

- Validation data without targets
- Is calling stack a LIFO queue correct?
- Why is the accuracy for my Keras model always 0 when training?
- Why KL divergence is negative in Pytorch?
- Is this classification model overfitting?
- How to calculate the computational complexity of machine learning algorithms
- Subsample size in scikit-learn RandomForestClassifier
- Common causes of nans during training of neural networks
- One class SVM - Outliers on test set very low relative to training set
- time series forecasting visit dates with customer classes graph not accurate
- How to accumulate gradients for large batch sizes in Keras
- Model precision is 0% in confusion matrix
- UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. 'recall', 'true', average, warn_for)
- Precision and F-score are ill-defined warning while using Python machine learning model
- How to fix this classification report warning?
- Is there a way to use found sequential patterns as input for a clustering algorithm
- Recursion in FP-Growth Algorithm
- How to visualize an XGBoost tree from GridSearchCV output?
- What is the role of normalization function in TensorFlow?
- Data augmentation not increasing dataset size
- CUML RandomForestClassifier TypeError An Integer is required
- scikit-learn custom transformer throws NotFittedError from underlying model
- Hyperopt set timeouts and modify space during execution
- Rescaling after feature scaling, linear regression
- Scala Support Vector Machine library
- Ways to get intermediate steps during optimization process in qiskit.algorithms.optimizers.ADAM
- r neuralnet package -- multiple output
- subplot for shap summary_plot
- How to improve accuracy score for this regression problem?
- AutoTrain advanced CLI: error: unrecognized arguments: --fp16 --use-int4