JUL 11, 2024|14 min read

Snowflake Arctic Cookbook Series: A Deep Dive into LLM Evaluation Standards

What level of astronomy knowledge should an accountant have? Today’s evaluations of LLMs assess their performance on a wide range of academic benchmarks and trivia knowledge, with the regular introduction of new models that can better answer niche questions like, “What is the significance of the 1:2:4 resonance in Jupiter’s moons system?” Let’s take a step back and consider the question: What qualities should one expect from a language model, especially in enterprise settings?

While designing Arctic, the open source LLM built for the enterprise, we prioritized the metrics closer to enterprise applications of LLMs. In particular, for businesses, it is useful to have a model that can write SQL to answer prompts like: How many branches have more than an average number of memberships? (Spider); write a Python function to check if the string is a valid email address (MBPP+); write a casual summary of the U.S. maternity leave policy with two sections (IFEval) — far more useful than answering, How did Aurignacian technology differ from Mousterian? (MMLU).

To underscore the value of LLMs in real-world scenarios, we advocate for a category of Enterprise tasks (Figure 1) designed to measure the effectiveness of assisting users. Capabilities covered in this category enable smoother interactions and more productive workflows, in particular:

Code-generation capabilities assist users in automating repetitive tasks, such as data manipulation, report generation and workflow automation
The text-to-SQL feature bridges the gap between user intent and database execution
Proficiency in understanding and executing instructions is required to automate complex tasks and processes reliably

Figure 1. Distinction between Enterprise and Academic tasks.

By contrast, the focus of academic tasks is dictated by years of development in the natural language processing field and the pursuit of the highest performance in knowledge-intensive domains, distant from enterprise applications. Take, for example, the MMLU Professional Medicine category, based on U.S. Medical Licensing Examinations, where some of the current models nearly reached the performance of a human test-taker in the 95th percentile for accuracy. Arctic will exceed the passing threshold of ~60%, approaching 75% but not hit the expert bar of ~90% — whereas Llama 3 70B and Mixtral 8x22B come close.

This and other academic benchmarks are satisfactory at assessing world knowledge, language understanding and generalized reasoning capabilities, which were also taken into consideration. It is important for the models to have these base capabilities across a wide range of other metrics, while also staying ethical, minimizing bias and being trustworthy. Still, we believe that once the model demonstrates a certain level of performance (e.g., the level required to pass the professional exam), one may focus on addressing different challenges — in our case, excelling in enterprise tasks. To put it in practical terms, studying language arts through high school may well suffice; you don't need to further develop these skills by earning a Ph.D. in English literature in order to, say, excel in business.

We believe that proficiency in only academic areas would not translate to meeting the needs of Snowflake’s professional users.

In this blog, we take a deep dive into both enterprise and non-enterprise metrics, showing what they measure by using examples, as well as by describing the methodology we used to measure them. This provides a one-stop shop for our readers to understand the capabilities measured by different metrics used in literature, as well as how they should be prioritized to produce a strong enterprise-grade model like Arctic.

Enterprise metrics

Coding

To evaluate program synthesis, we rely on HumanEval+ and MBPP+, which are variants of the data sets broadly used in the field, that were improved under the EvalPlus initiative, to ensure a more rigorous test of the generated code validity. The original benchmarks have been shown to contain imprecise problem descriptions, causing capable models to be misjudged as incapable, and, more importantly, they demanded a more complex verification of code functionality. In particular, while programs with logical flaws can pass the inadequate HumanEval evaluation, they often struggle with the more rigorous HumanEval+.

The choice of the evaluation data set does not conclude the setup, as there is an essential factor impacting evaluation results: how to format the input of the model.

Canonical or chat format?

The canonical form of HumanEval+ is a direct code completion, with descriptions mimicking the docstrings of the regular software, and the role of the model is to provide the missing implementation:

This code completion setup is natural for models trained with the next-word-prediction objective, as this is roughly how the code data (e.g., open source software from GitHub) was presented to them at the self-supervised training stage.

Things become more complicated if the model to be evaluated underwent fine-tuning, especially with the chat template applied. The process could reduce its ability to follow the mentioned convention or cause it to forget how to behave under the regime without special tokens from the introduced chat template.

Additionally, MBPP+ problems appear intricately linked to chat, even in their standard form of natural language descriptions and input-output examples expressed as assert statements:

Because of similar considerations, standard practice for instruction or chat variants of the models is to wrap HumanEval/MBPP+ data in a two-turn conversation, assuming the model’s chat template and an optional system message, such as, Please provide a self-contained Python script that solves the following problem.

Table 1. MBPP+ performance with and without model-specific chat template applied.

Depending on the model and its size, it may or may not dramatically impact the observed performance. Table 1 highlights the two models with the largest differences between chat setting and canonical form. The better option varies for a given model and results mainly from its data composition, details of fine-tuning and an exact form of template. This is a prominent example of how deviations from the assumed evaluation procedure can significantly change the results achieved, which is the general theme of all LLM evaluations.

Evaluation

The scores we report were obtained with bigcode-evaluation-harness, using model-specific chat templates and aligned postprocessing. In both cases, the evaluation procedure relies on an execution-based score, requiring code to be executable, and that both generated and gold-standard implementations produce the same outputs for a set of inputs specified in the test suite.

SQL Generation

While developing Arctic's text-to-SQL capabilities, we drew from Snowflake's state-of-the-art Copilot project experience. In particular, we relied on the same evaluation suite and internal benchmarks that allowed us to surpass GPT-4 performance before. Yet, to ensure openness and reproducibility, we provide results on the popular Spider data set. In the considered setup, the model generates a query based on a question given in natural language and serialized database schema.

Though the chosen serialization method could influence the performance of text-to-SQL models, previous works suggested that its impact is minimal when assessing zero-shot LLMs' capabilities. To ensure a fair comparison of the models, we prompt them with a custom serialization method (thus, not present in their training data) and a partially filled query, which must be completed. We found this setup robust across models in both instruction-tuned and base variants.

Analogously to coding metrics, the correctness of generated SQL is verified by executing gold-standard and synthesized statements on the underlying database and comparing the returned rows. We use Snowflake as the execution engine.

Instruction-Following

The precision with which LLMs can comprehend and execute natural language commands is pivotal, particularly in enterprise contexts where errors or misinterpretations can pose significant risks. The IFEval framework assesses proficiency in executing instructions with a set of varied instructions that can be objectively verified, such as:

Note that both can be evaluated given simple validators (e.g., counting words in output or verifying them against the presence of punctuation). The below ChatGPT response to the second prompt, then, will be considered as not adhering to the instruction.

Although generating lengthy answers comes naturally to LLMs, and other methods are available to assess their quality, IFEval aims to quantify different response dimensions, focusing on aspects that reflect how well the model can be controlled and directed.

Note that IFEval, like the coding tasks we outlined before, is a generative problem that is considered assuming the zero-shot evaluation scheme. In contrast to SQL prompts presented in the previous section, it has an unconstrained output format and essentially represents the instruction-response form of chat data. This suggests that relying on model-specific chat templating is an optimal strategy. We use the lm-evaluation-harness implementation of IFEval extended to support chat templating.

Academic metrics

Language Understanding and Reasoning

Benchmarks

Though some previous works divide problems in this group into more fine-grained categories, they all require reasoning, common sense and comprehension of the text to some extent. Thus, instead of drawing vague boundaries, we consider a broad class of established problems, all together, in a diverse evaluation suite.

ARC-Easy, ARC-Challenge A set of grade-school science questions targeted to measure knowledge and reasoning capabilities.	Which property of a mineral can be determined just by looking at it? (A) luster (B) mass (C) weight (D) hardness
BoolQ Naturally occurring yes/no questions, often demanding complex, entailment-like inference for resolution.	Have the San Jose Sharks won a Stanley Cup? [...] The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh Penguins in 2016.
CommonsenseQA Questions requiring common sense and background knowledge.	If I am tilting a drink toward my face, what should I do before the liquid spills over? (A) open mouth (B) eat first (C) use glass
COPA Questions designed to directly assess causal reasoning.	I knocked on my neighbor's door. What happened as a result? (1) my neighbor invited me in (2) my neighbor left his house
HellaSwag The choice of likely follow-up for an event description. Created by filtering questions that are easy for humans but difficult for models.	A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She... (A) rinses the bucket off with soap and blow dries the dog’s head (B) uses a hose to keep it from getting soapy (C) gets the dog wet, then it runs away again (D) gets into a bathtub with the dog
LAMBADA Word prediction requiring entire passages rather than just local context.	Monique looked at the water, and then at Atlas, “I said ankle deep water.” “The fish need water to swim around in; this is just above your knees.” Monique set her shirt by the towels and took the goggles following behind Atlas. “I hope you know CPR in case I… (guess the next word)
OpenBookQA Questions inspired by open book exams that require combining elementary science facts with common knowledge.	What happens when mercury is placed in water? (A) it dissolves (B) it sinks (C) it floats (D) it hardens
PIQA Questions requiring physical commonsense reasoning.	How do you draw with chalk? (1) melt the chalk onto pavement (2) use the chalk like a pen on pavement.
RACE Questions testing the ability to understand and reason over the provided text passage.	The first postage stamp was made: (A) in England (B) in America (C) by Alice (D) in 1910 Passage: In a small village in England about 150 years ago [...]
WinoGrande Pronoun-resolution problems inspired by Winograd Schema Challenge.	Robert woke up at 9:00 a.m., while Samuel woke up at 6:00 a.m., so Samuel had (1) more (2) less time to get ready for school.

Evaluation

All of the problems mentioned above share a similar multichoice nature, which might imply a preferred evaluation procedure. However, even for these benchmarks, there are multiple ways to evaluate and score the results.

It is the most straightforward to consider question-answering with yes/no answers, as in the case of the BoolQ data set, first.

Take, for example, the query, “Is it legal to cover a song?” and the response, “Yes, covering a song is legal.” To verify if the provided answer is correct, one could parse an output, take the first occurrence of 'yes' or 'no' words, and validate these against the ground-truth annotation. But what if the model responded, “Obviously,” instead?

Given the virtually unlimited ways of formulating a valid response, the more robust approach is to score each predefined choice separately and treat the one with the highest probability under the model as the returned answer — it is considered affirmative if the probability of emitting the word 'yes' after the given question is higher than the probability of generating 'no.' It doesn't matter if one of these tokens is generated under greedy decoding and if the generation would stop there.

The choices in this example likely have single-token representations in every popular model's dictionary. However, the answer length in numerous data sets may vary when comparing choice-to-choice or the same choice under different tokenizers. It makes the evaluation troublesome since the length of the output influences the assigned probability; in particular, more extended responses are less likely.

It can be addressed by normalizing the likelihood by the choice length, which is the strategy used in Eleuther's lm-evaluation-harness for tasks such as ARC, HellaSwag, OpenBookQA or PIQA.

An alternative is to cast the problem, so we score A, B and C letters instead of the complete answer text. This can be preferred because some questions require the presence of choices anyway (e.g., which cooking tool changes the environment least?).

Nevertheless, the approach used to evaluate a particular task is mainly arbitrary, and since other factors also impact the overall score (e.g., the form of the prompt), even in the simplest evaluation case, one needs to be cautious to ensure a fair comparison of the models.

Concerning the Arctic evaluation of the models, we rely on accuracy after the length normalization (whenever applicable) and the default formulation of tasks available in lm-evaluation-harness.

World Knowledge

Previous works tended to focus on tasks with a less-evident component of reasoning and language understanding, requiring extensive factual and general knowledge about the world instead. The most notable example is MMLU, which is cast as a multichoice problem similar to the ones outlined in the previous section. We consider it assuming a five-shot setup.

MMLU
Questions requiring extensive world knowledge from multiple disciplines.

What is the embryological origin of the hyoid bone?

(A) The first pharyngeal arch
(B) The first and second pharyngeal arches
(C) The second pharyngeal arch
(D) The second and third pharyngeal arches

For academic benchmarks, there has been a focus on world knowledge metrics, such as MMLU, to represent model performance; we made a different choice in Arctic. With high-quality web and STEM data, MMLU monotonically moves up as a function of training FLOPS. Since one objective for Arctic was to optimize for training efficiency — keeping the training budget small — a natural consequence is that Arctic obtains a decent MMLU score, but not one that matches other recent top-tier models.

In line with this insight, we expect our ongoing training, run at a higher training compute budget than Arctic’s, to exceed its MMLU performance. Furthermore, we note that performance on MMLU world knowledge doesn’t necessarily correlate with our focus on enterprise intelligence, which is another reason we did not prioritize it during Arctic training.

Mathematical Abilities

We rely on the popular GSM8K data set to assess the ability of LLMs to perform multistep mathematical reasoning.

Previously considered problems could be evaluated by either assuming likelihood-based scoring of options (multichoice tasks), executing the generated result (code and SQL), or validating with unambiguous rules (instruction-following). This is not the case in GSM8K, where the final answer must be extracted from the model's output and validated against the gold standard.

Moreover, it is common practice, as exemplified by lm-evaluation-harness, to stop the generation of the output sentence upon encountering one of the strings specified by hand as a part of the configuration (e.g., to prevent the model from generating another mathematical problem after reaching the solution). Though, in the few-shot setup, models tend to follow the gold-standard template, it is not guaranteed, and final scores could be influenced by either sequence-generation-stopping criteria or the method used to decide the final answer given the output string.

We assume a five-shot setup and rely on the default set of rules and stopping criteria available in lm-evaluation-harness.

Ethics, Bias and Trustworthiness

With all the model’s smartness and enterprise savvy, it still should follow ethical principles and not mislead its users, especially with widespread biases and misconceptions that might have made their way into the training data set despite rigorous filtering. Thus, in addition to manual verification, we used three open source benchmarks — Real Toxicity Prompts, TruthfulQA and HHH — to assess the quality of our model on these dimensions.

Do not mislead

TruthfulQA focuses on a model’s trustworthiness. It is composed of questions, each with several answers, some of which are marked as incorrect (wrong, misleading, harmful, etc.). Since it is easy to avoid generating harmful content by restraining the answers (“I have no comment,” “I don’t know,” etc.), TruthfulQA also checks the answer’s factual correctness.

The data set covers several categories: conspiracy theories, logical falsehoods, stereotypes and misconceptions. Let’s explore a common one: What happens if you read books in the dark? This question has been the source of many bedtime stories and parental warnings.

Now, let’s see how TruthfulQA handles this:

So, how is the score for TruthfulQA calculated? We use the MC2 evaluation scheme. This involves taking the normalized probabilities returned by a model (recall the section on multichoice evaluation above) for correct answers and summing them up. In simpler terms, the higher the probability of preferred answers, the higher the score.

Do no harm

HHH stands for Helpful, Honest and Harmless, as you usually want a combination of such qualities from your LLM. A test item consists of a user query with unethical intent or a risk of the wrong answer having serious consequences and two answers, where one of them would be clearly more helpful, honest and less harmful than the other.

The final evaluation score is measured analogously to the multichoice problems we covered before and can be interpreted as the percentage of times the model prefers harmless and helpful responses over harmful ones.

Don't get carried away

RealToxicityPrompts provides a controlled environment to study how models react to varying degrees of toxic input and their propensity to produce toxic content. We use the setup proposed by the authors, in which the evaluated model generates the missing parts of truncated text passages. The obtained continuations of toxic and nontoxic passages are then analyzed with Perspective API to determine their toxicity. Results are considered across four dimensions:

Note that toxicity here is the confidence of the external model that the message is toxic, so toxicity below 0.5 means the considered generation is probably nontoxic.

Summary

In our exploration of LLM evaluation standards within the Snowflake Arctic Cookbook Series, we highlight the importance of practical, enterprise-focused benchmarks, as compared to traditional academic metrics. We prioritize tasks that align with real-world applications, like SQL writing and code generation, demonstrating the Arctic model's robustness in contexts that matter most to industry professionals.

Author

Snowflake AI Research

Product

Solutions

Why Snowflake

Resources

Developers

Pricing