An AI team is choosing metrics to automatically evaluate the quality of summarization and translation. Which mappings of automatic text-generation evaluation metrics to their representative target tasks are correct? (Choose TWO.)

1 / 1
Select all that apply
CorrectA, B

Explanation

Choosing TWO correct mappings of metric to task.

  • 1automatic text-generation evaluation metrics to their representative target tasksThe combination of metric and task
  • 2Which mappingsROUGE = summarization, BLEU = translation are correct
ACorrect

ROUGE = summarization quality evaluation

Correct. ROUGE is a metric that measures how much the generated summary overlaps with a reference summary and is mainly used for quality evaluation of summarization tasks.

BCorrect

BLEU = machine translation quality evaluation

Correct. BLEU is a metric that measures how much the generated translation matches a reference translation and is mainly used for quality evaluation of machine translation tasks.

CIncorrect

ROUGE = machine translation quality evaluation

The mapping is reversed. ROUGE is the metric representatively used for quality evaluation of summarization, and the representative metric for machine translation is BLEU.

Swapping the target tasks of ROUGE and BLEU is a classic trap.

DIncorrect

BLEU = summarization quality evaluation

The mapping is reversed. BLEU is the metric representatively used for quality evaluation of machine translation, and the representative metric for summarization is ROUGE.

EIncorrect

BERTScore = inference speed evaluation

BERTScore is a metric that evaluates semantic closeness using embeddings.

It is not a speed metric, so the mapping is wrong and incorrect.

Key Takeaway

Automatic text-generation evaluation metrics differ by what they compare.
ROUGE: measures the n-gram overlap between a generated summary and a reference summary. Mainly for summarization evaluation.
BLEU: measures the n-gram match between a generated translation and a reference translation. Mainly for machine translation evaluation.
BERTScore: measures semantic closeness using embeddings. It scores high when the meaning matches even if the wording differs.
Swapping the target tasks (summarization / translation) of ROUGE and BLEU is a classic trap.