Dynamic Tables: Declarative Pipelines for Batch and Streaming
Batch data pipelines are nothing new or groundbreaking. But legacy streaming solutions often lead to complex and costly data processing and management. Combine that with the different skill sets needed to work with streaming data, and the highly specialized staff to handle it all, streaming has remained out of reach. At Snowflake Summit 2023, we announced the public preview launch of Snowflake Dynamic Tables, a new table type that drastically simplifies continuous data pipelines for transforming both batch and streaming data. Together with Snowpipe Streaming (GA soon), Snowflake removes the boundaries between batch and streaming systems and makes streaming pipelines easier than ever before.
But first, let’s talk about streaming for a second. Streaming as a term is often misunderstood. It is often pigeon-holed to only “low-latency” data applications, where low latency means sub-second. The problem with this approach is that it alienates a long list of use cases that are not “low-latency” for one reason or another but are still better thought of as streaming. For example, having an updated view of retail inventory every 10 minutes is a streaming use case. So is a use case where predictive manufacturing quality information needs to be delivered to quality engineers every minute. Categorizing these use cases as not streaming forces data practitioners to pick a technology first instead of focusing on their business needs. Picking a streaming technology for the sake of it, when it may be overkill for their requirements and costs, benefits no one.
At Snowflake, we always encourage customers to start from their business needs and use that to drive technology decisions. As this recent blog explains, our streaming technologies are designed to meet customers and use cases where they are, rather than pushing them to process data faster than they can use it. That's why we want to redefine streaming as a way of thinking, not a specific latency regime. Streaming is about hiding the complexity of scheduling from the practitioner, so they can focus on deriving business value from the data.
That said, streaming data has been notoriously difficult to work with in the past. There are a few reasons for this: To start, streaming architecture has been wholly separate from batch architectures. This means data engineers were having to stand up and maintain two parallel systems, which means twice the overhead and twice the failure points to deal with. In addition, the majority of enterprise use cases need to integrate batch data with streaming data, which adds an additional step in the pipelining and adds to the end-to-end latency. Second, previous generation streaming systems have a steep learning curve, putting it out of reach for a large portion of the data engineering community. Third, inefficiencies in processing means these streaming solutions are higher cost and do not work economically at scale, and thus most streaming projects end at the proof of concept stages. Finally, having multiple vendors in the critical path makes governance and security only as good as the weakest link in their stack.
If this resonates with you, you will be beyond pleased with Snowflake’s new Dynamic Tables capability, now available in public preview. Over the past couple of months, we have been working with select Snowflake customers to realize their dreams of simplified and easily manageable streaming pipelines, while gathering product feedback to make Dynamic Tables even better.
What makes Dynamic Tables magical?
- Latency is just a single tuning knob: Business requirements for latencies change over time. You may need data processing in a minute during peak season but over the rest of the time, that may be overkill and you may need a simple way to relax latencies to lower spend. With Dynamic Tables you can use the lag parameter, which sets your objective for data freshness in your complex pipelines. With a simple ALTER statement, you can switch your pipeline from delivering data freshness of say six hours to 60 seconds, with no rework required for your pipeline or its dependencies.
- Built-in incremental update support for fast moving data: Incremental updates, or processing only data that has changed (instead of full tables), has been a hard problem for a long time. Customers want built-in incremental maintenance of streaming tables without needing any additional logic, and they want this for all types of SQL functions and transformations. To simplify this, Dynamic Tables automatically applies incremental updates for both batch and streaming data.
- Streaming SQL semantics: The No. 1 pain point for most customers is how complex and effort intensive existing stream processing systems are. The need for specialized skill sets to build and the ongoing maintenance is a non-trivial activity for data engineers using Spark, Flink, or other streaming systems. A key trend we are excited about is making streaming work for the skillset most companies have: SQL. With Snowflake Dynamic Tables, customers can use simple and ubiquitous SQL with powerful stream processing capabilities to enable streaming use cases for a lot more customers without needing stream processing expertise in all the gory details.
But don’t just take our word for it. Here’s a sampling of customer feedback on their experience with Dynamic Tables
Dynamic Tables helped us in removing some of the design complexities around pulling the incremental data using Streams/Task. We see a 90% reduction in the compute usage based on the POC we performed. Looking forward to adding the Dynamic tables as part of data share/Private listing, to further simplify the design complexities. In a second use case, we were able to use dynamic tables to materialize a small subset of data and make it available to consumers in near real-time without the use of streams and complex SQL logic. The dynamic tables also proved to have a significant performance improvement over traditional streams on tables with high churn.”
Mike Siswanto, SVP
Komodo Health’s data platform uses Dynamic Tables, which automate incremental data refresh with low latency to simplify data engineering workloads, along with Snowflake’s architecture to eliminate inefficient and inflexible processes. This reduced the query run time by 350 hours, or just under 15 days per month."
Dineshkumar Munusamy
Let’s talk about use cases next. In addition to the above scenarios, Dynamic Tables are a great fit if you are trying to achieve any of the following use cases:
- Merging CDC data continuously into target tables with low latency
- Generating Slowly Changing Dimension tables
- Continuous data aggregation and summarization from multiple sources
- Filtering data by client in raw tables into separate tables
- Dimensional and data vault modeling, medallion architecture
- Materializing results of complex views for improved performance
The team has been hard at work building new capabilities into Dynamic Tables since we announced the private preview last fall. Below are just some of the improvements we have delivered.
More incrementalization: As promised, we continue to add incremental support for complex SQL features. Recent additions include Common Tables Expressions (CTE), outer joins, and window functions.
Observability: Full observability of pipelines with Dynamic Tables is a top priority for us so you can have confidence in your data operations and address any issues that may impact your businesses. We have added support for Dynamic Table statuses, detailed data freshness metrics, throughput metrics, dependency / DAG management and more! You can use Snowsight to easily keep abreast of your pipeline operations, or information_schema functions for more flexibility. You can even create alerts on refresh statuses to notify you in case there are issues that need your attention.
Developer experience: Snowflake :heart: Data engineers. We continue to invest in quality of life features for you. To start, you can create zero-copy clones of Dynamic Tables to quickly build pipelines or implement development / production deployments. You can also create Dynamic Tables in Python using Snowpark’s DataFrame APIs.
Pipeline controls: Control pipeline operations with powerful features. Suspend refreshes to make large scale changes to pipeline logic or troubleshoot pipeline failures. Trigger ad hoc refreshes to simplify creation of complex pipelines iteratively. We also introduced a new lag feature called “downstream”. With downstream lag, take the guesswork out of setting data freshness for intermediate Dynamic Tables, just declare the lag for the final Dynamic Table based on the data freshness you expect, all Dynamic Tables upstream can be downstream-dependent. That means intermediate Dynamic Tables will only refresh when something downstream requires it to, simplifying development and reducing waste.
Governance: Operate Dynamic Tables securely and confidently with Snowflake-grade governance features. Apply row based access policies or column masking policies on dynamic tables or sources to maintain a high bar for security and governance. Fine-grained privileges allow more control over who can manipulate business critical pipelines.
Expanded query support: Dynamic Tables continually and automatically materialize the results of a query you provide. We have added support for views, date and time functions, UDFs / UDTFs and SELECT * for Dynamic Table queries.
Query evolution: Auto-recover from the inevitable changes to your pipeline. If a simple change happens upstream, Dynamic Tables can absorb such changes without any action from you. We continue to expand this to absorb more upstream changes in an incremental way.
With these powerful features and a deep integration with the rest of Snowflake’s Data Cloud, Dynamic Tables can dramatically simplify data pipelines, both streaming and batch. We are by no means done, we continue to invest in making Snowflake’s Data Cloud the easiest way to create streaming pipelines, and are so thankful to our customers for the feedback that guides our development.
To learn more about Dynamic Tables and Snowflake's streaming pipelines, you can register here to watch the Summit session on-demand: What’s New: Streaming with Snowflake.