Continued Investments in Price Performance and Faster Top-K Queries
The Snowflake AI Data Cloud is an end-to-end platform that supports all types of data, compute, use cases and personas across an entire organization. By delivering a single, unified platform for all users, it is no surprise that organizations continue to expand their use cases on Snowflake. And therefore, it is extremely important for us to reaffirm our commitment to price-performant queries for our customers on a consistent basis.
Top-k queries are used to retrieve the highest- or lowest-ranked results, where k is the number of results to be retrieved. In June 2023, we launched a new optimization that improved the performance of long-running top-k queries by an average of 38% and, in some cases, up to 99%. At Snowflake, we understand the importance and time-sensitivity of these kinds of queries. We worked on further improving the performance of such queries, and from March 2024 to May 2024 we saw customer queries improve performance by 12.5% on average, with some queries also reaching up to 99.8% improvement.
How does it work?
As we describe in this blog post, the top-k feature uses runtime information — namely, the current contents of the top-k elements — to skip micro-partitions where we can guarantee that they won’t contribute to the overall result. We saved quite a bit of input/output operations (I/Os) and improved the performance of queries quite significantly. With our recent optimizations added to top-k, we managed to further reduce the number of I/Os, improving the performance of queries quite significantly.
Architectural difference: While our first version still required one I/O per partition, to retrieve the partition’s metadata and then decide whether it can be skipped, now, top-k utilizes Snowflake’s advanced metadata layer to embed relevant metadata directly into the query itself, and therefore requires no additional I/Os when deciding whether to skip partitions. The benefit of this can be seen especially on large tables: Let’s say we have a table with 1 million partitions clustered by a timestamp column. We run a query with an ORDER BY timestamp LIMIT 10 on the table. Previously, this would require at least one million I/Os, plus the I/Os of the partitions we actually need to scan, to construct the result. Now, we only need a handful of I/Os for the partitions that we actually need to load.
Algorithmic perspective: From an algorithmic perspective, we implemented a way to process partitions in a smart order, which further reduces the number of I/Os. Before Snowflake starts executing the query, we look at the metadata of the partitions to determine whether the contents of a given partition are likely to end up in the final result. Snowflake starts processing those partitions first. This implies that the initial top-k elements we see after starting the query are already some of the largest/smallest rows, which will also likely end up in the final result. Combining this with the skipping technique described in the June 2023 blog post, we are able to skip even more partitions compared to a random order of scanning partitions.
With both ideas combined, we can oftentimes answer queries on clustered tables by processing just a handful of partitions, even on very large tables.
Other improvements include the ability to support STRING/VARCHAR/BINARY data types for the ORDER BY key, which are roughly 10% of all top-k queries. We also observed that the number of scanned partitions in the query overview is reduced for cases where top-k was applicable.
Impact of top-k improvements: What it means for you
These new optimizations can significantly improve performance, reduce computational costs and help provide faster insights from large data sets. During the private preview, we observed many ORDER BY LIMIT customer queries go from minutes to seconds. Below, you can see a sample of impacted queries being accelerated by as much as 99.9%. During the rollout of the feature, we observed that the average execution time of affected queries sped up by 12.5%.
When breaking up the queries into four groups, depending on their runtime, we can see that longer-running queries show a larger relative speedup than shorter-running queries.
Looking at the absolute reduction, we can see that a lot of queries reduced their runtime by only a small amount, while a small share reduced by 1 second or more. This is in line with expectations, as many top-k queries on Snowflake are already running very quickly, leaving only small margins for absolute reduction.
How to get started
The good news is you don’t need to make any changes. The query optimizer will automatically apply this enhancement to any queries that would benefit from it and will improve their performance. Read how Sygnia, a leading cybersecurity platform, migrated from Elasticsearch to Snowflake and used top-k functionality to deliver faster query results to their customers.
Conclusion
At Snowflake, we're on a continuous quest to enhance performance, with a particular focus on accelerating the core database engine, and we are proud to deliver these performance improvements through our weekly releases. In this blog post, we covered a recently released performance optimization that’s broadly applicable, highly impactful and now generally available to all customers.
To learn how Snowflake measures and prioritizes performance improvements, please read more about the Snowflake Performance Index here. For a list of key performance improvements by year and month, visit Snowflake Documentation.