EVALUATING LARGE LANGUAGE MODELS (LLMS): COMPARISON METRICS AND THEIR IMPACT ON GENERATED TEXT QUALITY

Authors

  • CERÓN-LÓPEZ MARCO-TULIO, PEÑA-AGUILAR JUAN-MANUEL, MACÍAS-TREJO LUIS-GUADALUPE, PANTOJA-AMARO LUIS-FERNANDO, BAUTISTA-LUIS LAURA

Abstract

Large Language Models (LLMs) have revolutionized artificial intelligence, enabling the generation of coherent and contextually relevant text. However, evaluating your performance requires robust metrics tailored to various tasks. This article discusses the most commonly used metrics to compare LLMs, such as Perplexity, BLEU, ROUGE, F1-Score, and Human Assessment, highlighting their advantages and limitations. Through a systematic literature review and comparative analysis, the most appropriate metrics for specific tasks, such as machine translation, text summarization, and dialogue, are identified. The results show that, although automatic metrics are useful, Human Assessment is still indispensable to capture qualitative aspects such as consistency and fluency. This work contributes to the field by proposing an integrated framework for the evaluation of LLMs, combining automatic and human metrics, and suggests future lines of research to improve accuracy and ethics in text generation.

Downloads

How to Cite

CERÓN-LÓPEZ MARCO-TULIO, PEÑA-AGUILAR JUAN-MANUEL, MACÍAS-TREJO LUIS-GUADALUPE, PANTOJA-AMARO LUIS-FERNANDO, BAUTISTA-LUIS LAURA. (2025). EVALUATING LARGE LANGUAGE MODELS (LLMS): COMPARISON METRICS AND THEIR IMPACT ON GENERATED TEXT QUALITY. TPM – Testing, Psychometrics, Methodology in Applied Psychology, 32(S8 (2025): Posted 05 November), 608–615. Retrieved from https://tpmap.org/submission/index.php/tpm/article/view/2693