EVALUATING LARGE LANGUAGE MODELS (LLMS): COMPARISON METRICS AND THEIR IMPACT ON GENERATED TEXT QUALITY
Abstract
Large Language Models (LLMs) have revolutionized artificial intelligence, enabling the generation of coherent and contextually relevant text. However, evaluating your performance requires robust metrics tailored to various tasks. This article discusses the most commonly used metrics to compare LLMs, such as Perplexity, BLEU, ROUGE, F1-Score, and Human Assessment, highlighting their advantages and limitations. Through a systematic literature review and comparative analysis, the most appropriate metrics for specific tasks, such as machine translation, text summarization, and dialogue, are identified. The results show that, although automatic metrics are useful, Human Assessment is still indispensable to capture qualitative aspects such as consistency and fluency. This work contributes to the field by proposing an integrated framework for the evaluation of LLMs, combining automatic and human metrics, and suggests future lines of research to improve accuracy and ethics in text generation.
Downloads
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.