Hyperparameter optimization (HPO) is a frequently used machine learning technique to select the best parameter combination for a given model. Popular open source packages, such as scikit-learn, provide easy-to-use APIs for users to train their model using different hyperparameter combinations with different numbers of cross-validation folds.
Snowflake ML is a set of integrated capabilities for users to carry out a complete machine learning workflow on top of your governed data. You can load data tables from Snowflake, then proceed with the Snowpark ML Modeling APIs to do preprocessing, modeling training and hyperparameter tuning using familiar scikit-learn-style APIs. Additionally, you can register your model for inference with the Snowflake Model Registry. The entire process is seamlessly managed, with Snowflake’s engine running behind the scenes.
Sounds like a good deal, right? Our product documentation on Distributed Preprocessing indicated an impressive speed-up for the preprocessing class. Naturally, you’d expect similar enhancements in the model tuning phase. However, challenges arise when dealing with larger data sets (exceeding 10GB) and an extensive hyperparameter search space. So, why is your hyperparameter tuning taking so long? Why does expanding your hyperparameter search space slow down the process? And why might you encounter an out-of-memory error with a 10GB data set?
Initially, we integrated directly with scikit-learn’s GridSearchCV and RandomizedSearchCV tuning implementations, and off-loaded the computation to the Snowflake engine. Underneath the hood, we use a single node (stored procedure) to execute the tuning job. We used a stored procedure because the training process requires access to the full data set and can be repurposed for every function. However, while stored procedures are easier to implement, they are inherently single-node, regardless of the chosen warehouse size. This limitation means that even if a warehouse has eight nodes, not all will be used for this API request.
Ever striving to improve the efficiency of compute for our customers, we shortly thereafter started exploring how to leverage all the nodes within the warehouse. If tuning on one node takes eight hours for a large data set, can we reduce the tuning time to one hour with eight nodes? To execute the tuning job in a multi-node fashion we process each hyperparameter combination as a table in a user-defined table function (UDTF). We made each of those GridSearchCV only accept one hyperparameter combination, disabling the parallelization backend. A Snowpark UDTF scheduler can handle the distribution of each worker and make the best use of the resources of your warehouse.
While this approach might seem reasonable, we quickly found that multi-node jobs became memory constrained. Astute readers will realize that this workload soon becomes memory-bound. Since we are executing multiple independent ‘fits’ per machine in parallel, we are loading the training data multiple times into each node. We observed that the distributed tuning API can only handle the data sets up to 10GB. What happens above 10GB? Jobs see inefficiencies as a result of memory usage.
snowflake.snowpark.exceptions.SnowparkSQLException: (1304): <<Your Query ID>>: 100387 (53200): Function available memory exhausted. Please visit https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-designing.ht
ml#memory for help.
To enhance the memory efficiency of Snowflake’s hyperparameter tuning jobs, we adjusted how Snowflake schedules distributed table function workloads, specifically for our hyperparameter tuning APIs in Snowflake ML. Previously, the UDTF scheduler would pack as many workers onto each node as possible. For our tuning use case, though, every worker in our UDTF requires loading the user’s entire training data set. Thus, we modified the hyperparameter tuning scheduler such that each node generates only one UDTF worker. Furthermore, with the help of cachetools, we cache the data set loading results, ensuring each node loads the data set only once, saving on data loading time between sequential worker processes. Within each worker, we utilize joblib for parallel scheduling. Thus, we achieve a hybrid distribution model, where Snowflake ML’s UDTF scheduler is responsible for distributing work across the nodes in the warehouse, and within each node we provide additional parallelization via joblib. Using this hybrid approach, all training data only needs to be loaded into each machine in the warehouse one time, without duplicating data replications or data load operations.
With the improvements to the scheduling and tuning algorithms, we have shown that instead of maxing out at ~10GB data sets, Snowflake users can run distributed tuning jobs using a Snowpark-Optimized Warehouse with data sets of 100 GB and beyond.
This results in massive improvements in tuning efficiency as we scale up the number of rows (Figure 3). On small data sets you might not notice the difference, but as you hit 10+ million rows, you will see up to 250% improved compute usage.
Figure 3. Improvement of HPO efficiency by warehouse and table size with and without HPO memory optimization in both Snowpark-Optimized (top chart) and standard (bottom chart) warehouses.
Conclusion
Implementing distributed hyperparameter optimization has led to significant improvements in model training efficiency, particularly when scaling from smaller warehouses to larger ones, such as the 2X-Large Snowflake warehouse. This approach has allowed us to leverage Snowflake’s full potential, dramatically reducing processing times and enhancing overall performance. Distributed HPO in Snowflake ML is generally available for use today. The success of this implementation wouldn’t have been possible without the dedicated support and involvement of everyone on the team. Thank you all for your contributions and commitment to this project.