Smarter, Faster and Snowflake-Native: Real-Time Text2SQL Behind Snowflake Intelligence

Figure 1. Compared to other models indicated here, Arctic-Text2SQL-R1.5, achieves highest accuracy for Snowflake SQL at enterprise-grade speed. Measurements based on API requests to Cortex AI or third-party providers. [1]

Interactive analytics demand real-time responses.

When business users ask questions in Snowflake Intelligence, they expect instant answers. While leading LLMs can generate SQL with reasonable accuracy, their inference latency and API costs can make them impractical for production-scale conversational analytics. Moreover, these models struggle to keep pace with Snowflake's rapid feature releases, requiring extensive prompting to handle dialect-specific constructs like semantic views.

To address these challenges, we developed Arctic-Text2SQL-R1.5, a specialized reasoning model purpose-built for Snowflake SQL.

As shown in Figure 1, Arctic-Text2SQL-R1.5 leads in both accuracy and speed based on internal Snowflake SQL benchmarks designed around some of the most challenging Text-to-SQL workloads in production, outperforming general-purpose models such as GPT-5, Claude Sonnet 4.5 and Gemini 2.5 Flash. The chart highlights Arctic’s position in the upper-left quadrant — the region representing high accuracy and low latency — demonstrating its efficiency in translating complex analytical questions into executable SQL.

When deployed in Snowflake Intelligence for Text2SQL capabilities, the model exhibits comparable accuracy to Sonnet 4.5 while delivering significantly lower latency for verified query patterns (up to 3x faster). Combined with our high-performance serving stack, this enables Snowflake Intelligence to provide real-time, reliable answers at enterprise scale.

We achieve this by developing a powerful reasoning recipe for Text2SQL that can be applied to lightweight models to achieve higher accuracy, and then applying powerful in-house inference optimizations to reduce the inference latency.

From SQLite to Snowflake

Our earlier model achieved state-of-the-art results on academic Text2SQL benchmarks, which typically use SQLite. Transitioning to Snowflake SQL required overcoming substantial dialect and operational differences. Snowflake's unique constructs — from VARIANT types to time-travel queries — require deep understanding beyond what general SQL training provides.

We leveraged a two-phase approach:

Transfer learning across dialects. Foundational reasoning and schema understanding transfer effectively between SQL dialects. We initially trained Arctic on SQLite before fine-tuning it for Snowflake.
Execution-based reinforcement learning. Using our GRPO (Group Relative Policy Optimization) recipe, we applied execution-based rewards to optimize real query correctness rather than surface-level similarity.

Previously, running large-scale RL training with SQLite was cumbersome due to concurrency limits and locking. By leveraging Snowflake's multicluster warehouses, we simultaneously executed thousands of test queries, enabling scalable and reliable feedback loops.

This infrastructure advantage allowed us to implement an ambitious curriculum-learning pipeline: We began with a large synthetic corpus to establish broad SQL reasoning capabilities, then fine-tuned on human-annotated Snowflake data set to capture production nuances. The result: a model that outperforms leading general-purpose LLMs on Snowflake-specific SQL execution accuracy.

Benchmarking and results

We evaluated Arctic-Text2SQL-R1.5 on Snowflake-specific benchmarks designed to capture complex SQL structures and dialect-specific features that academic data sets typically miss. These benchmarks reflect real production patterns — queries grounded in evolving schemas, variant data types and user-driven analytical behaviors.

As shown in Figure 2, Arctic-Text2SQL-R1.5 achieves the highest single-turn execution accuracy from our test set, reaching 45%, compared with 44% for Claude Sonnet 4.5 and 40%–41% for other leading models, such as Gemini 2.5 Flash, Haiku 4.5 and GPT-5 Codex. This demonstrates the model’s strong generalization to Snowflake’s SQL dialect and its ability to generate executable, schema-aware queries.

Together with the latency gains shown earlier in Figure 1, these results highlight Arctic-Text2SQL-R1.5’s ability to deliver both higher accuracy and lower latency, enabling real-time analytics at scale.

Figure 2. Snowflake Text-to-SQL generation accuracy (single-turn generation grounded on the database structure and user question).

Why Arctic performs better on Snowflake SQL

Academic Text2SQL benchmarks may have limited malformed temporal fields, heterogeneous data types or percentile-based summaries. In production data lakes, however, these challenges are common — timestamps arrive as inconsistent strings; numerical fields contain outliers; and schemas evolve over time. The improvements offered by Arctic-Text2SQL-R1.5 are most visible on such messy, real-world data. The temporal-aggregation task shown in Figure 3 is one representative example.

Figure 3. Arctic-Text2SQL-R1.5 cleans messy timestamps and computes the median event-to-booking days in Snowflake.

The pattern above illustrates where Arctic-Text2SQL-R1.5 stands out: It selects Snowflake’s secure functions, applies the correct date-parting semantics, and chooses robust aggregates suited for production analytics. Security functions such as TRY_TO_TIMESTAMP handle malformed input gracefully instead of failing on dirty rows. Correct date-parting semantics — using DATEDIFF rather than manual arithmetic — ensure that time intervals reflect true business logic. And robust aggregates like MEDIAN yield stable, outlier-resistant summaries that align with analytical expectations.

In contrast, general-purpose models often over-engineer or misread the task: They rebuild timestamps from separate DATE and TIME columns, perform raw arithmetic on epoch values, or apply strict parsing that breaks on real data. These patterns lead to noncompiling or unreliable SQL.

These behaviors explain the benchmark advantage observed earlier — Arctic-Text2SQL-R1.5 consistently converts complex, messy inputs into reliable, executable Snowflake SQL, forming the foundation for real-time intelligence in production environments.

Usage in Snowflake Intelligence

The accuracy and latency improvements of Arctic-Text2SQL-R1.5 translate directly into production value. The model powers the text-to-SQL reasoning capability used by Snowflake Intelligence, where certain natural language questions must be translated into SQL and executed in real time to maintain the interactive experience that business users expect.

The agentic planning and orchestration layer of Snowflake Intelligence leverages models adaptively to balance speed and accuracy:

If a user query closely resembles a previously verified question, the system can bypass deep reasoning, allowing Arctic to deliver near-instant results. See the similar accuracy with blazing-fast latency as shown in Figure 4.
For novel or complex queries, the orchestration layer dynamically scales compute resources to support full reasoning depth, promoting accuracy. The use of the Arctic model for this adaptive execution feature is under active development and will be expanded in future versions.

This adaptive mechanism leads to both high throughput and low latency, critical for interactive analytics.

Figure 4. Arctic-Text2SQL-R1.5 as a drop-in replacement for Sonnet in Snowflake semantic Text2SQL (Snowflake Intelligence latency, normalized numbers, both models have a comparable accuracy).

Underneath inference optimization

We leverage existing inference optimization techniques and two in-house innovations to run this model at lower latency (both available in our open source Arctic Inference library). The first generates several tokens at a time instead of one token at a time to reduce latency, and the second aggregates the resources of multiple GPUs together to reduce latency without increasing cost.

Diving deeper into these innovations:

Suffix decoding (paper) is a speculative decoding method that can quickly guess output tokens when they start to repeat and can substantially speed up output-token generation. It fits Arctic-Text2SQL-R1.5 because Text2SQL, like many other coding tasks, exhibits repetitive outputs, such as syntactic structures and standard programming patterns.
Shift parallelism (paper) adaptively shifts between using sequence parallelism, which offers higher throughput for processing many tokens in parallel (e.g., prompt processing), and tensor parallelism, which achieves lower latency for processing fewer tokens in a batch (e.g., output generation). By doing so, shift parallelism can speed up the end-to-end request latency.

Using FP8 quantization and these Arctic Inference optimizations, we delivered 1.85x faster output-token generation, resulting in 1.65x faster end-to-end request completion than vanilla vLLM.

Figure 5. Impact of inference optimization (metrics averaged per request, average input of 9.3k tokens).

Conclusion

Arctic-Text2SQL-R1.5 represents a fundamental shift in how we design and deploy language models for enterprise data platforms. Rather than relying solely on general-purpose LLMs — which remain valuable for broad reasoning tasks — we complement them with specialized models like Arctic that are purpose-built for Snowflake SQL. This approach demonstrates that focused, domain-trained models can achieve both superior accuracy and dramatically lower latency when grounded in execution feedback and platform-specific training.

As Snowflake's feature set evolves and conversational analytics become more sophisticated, Arctic-Text2SQL will continue to advance alongside them, maintaining the tight coupling between model capabilities and platform features that makes real-time intelligence possible.

Stay tuned for upcoming releases.

Contributors: Lukasz Borchmann, Gaurav Nuti, Krzysztof Jankowski, Julita Oltusek, Ye Wang, Aurick Qiao, Jeff Rasley, Samyam Rajbhandari, Zhewei Yao, Yuxiong He

¹Our model with Cortex AI (4k reasoning tokens limit), Together.ai for open source Qwen3-Coder-480B-A35B-Instruct-FP8, OpenAI for gpt-5-2025-08-07 and gpt-5-codex (both with high reasoning effort), Anthropic for claude-sonnet-4-5-20250929 and claude-haiku-4-5-20251001 (4k reasoning tokens limit), Google for gemini-2.5-pro and gemini-2.5-flash (dynamic thinking budget), X.AI for grok-4-0709 and grok-code-fast-1.