An AI team feels limited by evaluating heavily paraphrased generated text using only surface-match metrics. Which metric evaluates the quality of generated text not only by surface-level word matching but also by semantic closeness using embeddings?

1 / 1
Select an answer
CorrectB

Explanation

A question that asks which metric evaluates by semantic closeness.

  • 1not only by surface-level word matchingWord matching alone is not enough
  • 2semantic closeness using embeddingsEmbedding-based evaluation = BERTScore
AIncorrect

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is an evaluation metric for summarization tasks that counts how much the words and n-grams (consecutive words) overlap between a human-written reference summary and a generated summary. The more overlap, the higher the score.

Because it looks at surface-level word matching, paraphrasing with the same meaning tends to lower the score, and the metric that evaluates semantic closeness is BERTScore, so it is incorrect.

BCorrect

BERTScore

Correct. BERTScore is an evaluation metric that converts each word of the generated and reference text into embeddings (numeric vectors representing meaning) using a language model such as BERT, and measures semantic agreement by the closeness (cosine similarity) between the vectors. Even when the words differ, a high score is given if the meaning is close, so it handles paraphrasing such as "car → automobile." It complements ROUGE and BLEU, which look only at surface-level word matching.

CIncorrect

BLEU

BLEU (Bilingual Evaluation Understudy) is a machine translation evaluation metric based on n-gram (consecutive word) matching with a reference translation.

It is surface-match based and not a semantic evaluation using embeddings, so it is incorrect.

DIncorrect

Perplexity

Perplexity is a metric that represents how confidently the model predicted the next word (the certainty of the prediction), where a lower value means the next-word prediction is more certain, indicating that the model captures the language well.

It is not a metric that measures the semantic closeness between generated and reference text, so it is incorrect.

Key Takeaway

Remember the representative metrics for text generation evaluation.
ROUGE: For summarization evaluation, counts the overlap of words and n-grams with a reference summary (surface-match based).
BLEU: For machine translation evaluation, looks at n-gram matching with a reference translation (surface-match based).
BERTScore: Converts generated and reference text into embeddings (vectors) and evaluates by semantic closeness (robust to paraphrasing).
The surface-match ROUGE and BLEU are weak against paraphrasing; when you want to evaluate by semantic closeness, use BERTScore.