In the field of large-scale data processing and machine learning, out-of-memory (OOM) errors pose significant challenges due to the massive amounts of data involved. Practitioners address OOM issues in Spark and other distributed computing frameworks using several strategies. Key approaches include:
- partitioning data to ensure that it is distributed and processed across multiple nodes in a cluster
- using lazy evaluation to optimize execution plans
- employing caching mechanisms to reuse intermediate results efficiently
In addition, memory management is fine-tuned. It’s usually controlled by system-exposed knobs like, for example, Spark configuration parameters. You may have used ‘spark.executor.memory,’ ‘spark.executor.memoryOverhead’ and ‘spark.sql.shuffle.partitions,’ which control memory allocation between tasks and storage. These techniques, combined with dynamic resource allocation, help mitigate OOM errors and improve the scalability and performance of big data applications. In this article, we’ll explain how newly introduced dynamic resource allocation strategies in Snowpark alleviate OOM concerns for big data applications.
Instead of requiring customers to fine-tune numerous configuration knobs, our approach at Snowflake is to take a different approach by abstracting the need for extensive tuning away from the user. Snowflake automatically manages resources and optimizes performance under the hood. This article focuses on recent enhancements to reduce the chances of OOM errors in customer workloads.
Tackling OOM issues
In Snowpark jobs, you may encounter OOM issues that manifest as errors, such as ‘Function available memory exhausted’ or a more general error message, such as ‘SQL execution internal error: Processing aborted due to error,’ accompanied by specific error codes.
Previously, we allocated a relatively low memory ceiling to reduce resource contention. This means in the past you may have hit OOM issues much more frequently.
-- To allocate mibs memory in python.
CREATE OR REPLACE FUNCTION allocateMemoryMibs(mibs int)
RETURNS int
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
HANDLER = 'run'
AS
'
def run(mibs):
data = bytearray(int(mibs) * 1024 * 1024)
return mibs;
';
-- Before this improvement, allocating more than 5GB memory will hit 'Function available memory exhausted'.
select allocateMemoryMibs(6000);
‘Function available memory exhausted’ indicates that the job has used up all available memory on at least one node in the cluster. General guidance on how to troubleshoot this problem can be found in the Python UDF troubleshooting guide. ‘SQL execution internal error’ is a more general error message that serves as an umbrella notification, suggesting a Snowflake support expert should be consulted. The process of debugging OOM issues may negatively impact the customer’s production workload, can be time-consuming and may incur additional costs due to random failures. We’ve made some major improvements in this area and the rest of the article will focus on how we’ve made this easier.
Enhanced managed scheduling for Snowpark
OOM issues typically arise from two primary sources:
- A single job requires more memory than the cluster can provide
- An excessive number of jobs are scheduled on the same node, leading to resource contention
In other services and platforms, the customer must tune memory settings for each job and decide when to schedule. Snowflake, on the other hand, manages a warehouse scheduler so that customers don’t need to worry about when to emit each job. The warehouse scheduler automatically schedules jobs based on their requirements and ensures they run at optimal capacity, alleviating the need for manual intervention.
Memory usage is a crucial factor for schedulers to avoid OOM errors. To do this, we estimate memory needs for upcoming job runs, and then schedule the corresponding jobs to cluster nodes, based on their available capacity due to other concurrently executing jobs. We use a historical memory usage-based prediction algorithm to estimate the memory needs for upcoming runs.
For p-th percentile of a sliding window, given:
- memory_usage_history = M
- Sliding_window_size = k
- percentile = p
By including memory-consumption patterns, the algorithm provides more accurate memory estimates, enabling the scheduler to allocate resources efficiently and prevent node overloading. We also don’t need to keep the low memory ceiling that we had before to preemptively avoid potential OOMs, therefore granting more total memory for the customer workload on each node. This proactive approach reduces the risk of OOM errors and enhances overall warehouse performance and stability, saving customers from having to open support tickets or retrying costly jobs because of random failures.
While our historical memory usage-based prediction algorithm significantly improves resource allocation, reduces resource contention and then reduces OOM errors, several factors, such as input data size, customer native code and others, can affect the memory-estimation accuracy. We consider these workloads to be unpredictable or volatile in terms of resource consumption, making scheduling based on historical estimates challenging. To address this, we have also employed a query retry mechanism for eligible workloads. This generally leads to a higher success rate, even when memory usage drastically changes, further enhancing reliability and performance.
Looking forward
The enhanced scheduler with a historical memory usage-based prediction algorithm is transparent to customers, so we can allocate resources more efficiently and prevent node overloading. With monitoring warehouse load and more incoming Snowpark metrics in Snowflake Trail, debugging OOM issues will be much easier. We are currently focused on enhancing our scheduling algorithm to improve the first-run success rate, increase memory-prediction accuracy, minimize the need for retries and further optimize performance and stability.