Large language models (LLMs) are an essential component of many AI applications, especially generative systems. By design, LLMs are probabilistic, using relationships and patterns learned during training to generate new data or predictions. LLM inference refers to a machine’s ability to draw conclusions based on prior knowledge or context clues. In this article, we’ll explore how LLMs use inference to convert input prompts into contextual responses. We’ll look at several popular LLM performance metrics and share techniques for creating high-performing models that produce relevant, coherent outputs quickly and efficiently.
How LLM Inference Works
LLM inference is the mechanism large language models use to generate human-like responses. Once a generative model receives an input prompt from the user, it draws on knowledge gained during training to predict the most likely tokens in the sequence before decoding those tokens into text outputs. This process takes place in two stages: the prefill and decode phases.
Prefill phase
During the prefill stage, user inputs are converted into tokens before being converted into numerical values the model can understand and work with. In the context of generative LLM inference, tokens represent words or parts of words.
Decode phase
In the decoding phase, the model responds to the user prompt by generating a series of vector embeddings, the deep learning algorithm’s response to the input prompt. It does this by sequentially predicting the next token based on context and prior knowledge. The model iteratively repeats this process until it completes its response, converting the output back into human-reading language.
LLM Inference Performance Metrics
There are several ways to measure and evaluate LLM inference. As generative large language models become more important in enterprise applications, establishing reliable performance metrics is vital to improving their performance and better understanding their capabilities and limitations.
Latency
Latency measures the total length of time it takes the LLM to generate a response to a user’s prompt. The faster the model, the lower the latency. This LLM inference metric is especially important for real-time applications such as customer service chatbots, language translation, and retail or content recommender systems.
Latency is measured in two parts: time to first token (TTFT) and time per output token (TPOT). TTFT is the amount of time a user needs to wait before receiving a response to their input. TPOT measures the time needed to generate an output token for every user currently querying the system. Taken together, these two metrics account for the total time needed to generate a complete response.
Throughput
Throughput refers to the number of requests that can be processed or output generated within a certain period of time. This can be measured in two ways. The first is requests per second, a metric useful for evaluating concurrency. The second is token per second. This metric tracks how many tokens per second an inference server can generate across all users and requests. It is the more popular method for measuring throughput because it isn't dependent on the length of the model input or output.
Techniques for Optimizing LLM Inference
Large language models are highly capable but computationally intensive, making efficient inference a key challenge. Various techniques can optimize the inference process for faster performance. As LLMs become more widely deployed, optimizing inference will be critical for enabling their practical use across different applications and devices. Here are several techniques that can be used for optimization.
KV caching
Key-value (KV) caching is a popular transformer-specific optimization technique that makes LLM inference more computationally efficient. It allows for the fact that each new token is reliant on the key and tensor values of those that preceded it. By caching the key and tensor values in GPU memory, KV caching eliminates the need to recompute many of the previous tensors as the model generates each new token.
Batching
Request batching is an easy way to improve throughput, making LLM applications faster and more responsive. When user requests are loaded in batches rather than one at a time, the model parameters don’t need to be loaded as often. But collecting the maximum number of inputs before processing them can increase latency, the amount of time a user needs to wait. Continuous, or in-flight batching can help compensate for this by evicting finished sequences from the batch, allowing a new request to replace it. In-flight batching significantly improves GPU utilization, reducing the amount of time it takes the LLM application to provide a complete response to the user.
Model parallelization
Distributing an LLM across large clusters of GPUs enables organizations to run larger, more efficient models that can handle increasingly larger batches of inputs. Parallelization works by partitioning the model, spreading its compute and memory requirements across multiple GPUs or instances.
Model optimization
Adjusting the model weights reduces how much memory is consumed on each of the GPUs the model runs on. Model optimization techniques focus on making resource-conserving adjustments to the model itself to boost performance without sacrificing quality. Distillation and quantization are two examples. Distillation involves using a larger LLM to train a smaller one. The end result of this process is a smaller model with similar inference capabilities to the larger one. Quantization is a compression technique that shrinks size and memory usage. By modifying the precision of the model’s weights and activations, quantization techniques can create models that consume fewer resources while producing comparable results.
Model serving
In-flight batching, discussed above, is one example of a model serving optimization. The end goal of all model serving techniques is to reduce how much memory is consumed, particularly as the model weights are loaded. Speculative inference accelerates LLM inference through the use of a smaller, less resource-intensive draft model to generate speculative tokens. If the speculative tokens match those generated by the verification model, they’re accepted. If not, they’re discarded and the process begins again.
Snowflake Cortex AI: Gain Instant Access to Industry-Leading LLMs
Cortex AI offers customers instant access to industry-leading large language models (LLMs) trained by researchers at companies like Mistral, Reka, Meta and Google, including Snowflake Arctic, an open enterprise-grade model developed by Snowflake.
Since these LLMs are fully hosted and managed by Snowflake, using them requires no setup. Your data stays within Snowflake, giving you the performance, scalability and governance you expect.