Product and Technology

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Announcing New Innovations for Data Warehouse, Data Lake, and Data Lakehouse in the Data Cloud

Over the years, the technology landscape for data management has given rise to various architecture patterns, each thoughtfully designed to cater to specific use cases and requirements. These patterns include both centralized storage patterns like data warehouse, data lake and data lakehouse, and distributed patterns such as data mesh. Each of these architectures has its own unique strengths and tradeoffs. And, since historically tools and commercial platforms were often designed to align with one specific architecture pattern, organizations struggled to adapt to changing business needs – which of course has implications on data architecture.

At Snowflake, we don’t think prescribing a single pattern for all customers to adopt suits their interests. Instead, we strive to help customers by providing a platform to build architectures based on what works in their organization, even if that changes over time. With our customers, we’ve seen Conway’s Law playing out more often than not. Use cases change, needs change, technology changes – and therefore data infrastructure should be able to scale and evolve with change. We’re committed to giving customers a choice and the ability to adapt while maintaining our core tenets of strong security and governance, excellent performance and simplicity.

For example, customers who need a centralized store of data in large volume and variety – including JSON, text files, documents, images, and video – have built their data lake with Snowflake. Beyond that, many customers with a company-wide repository of tables highly optimized for SQL, and highly concurrent business intelligence workloads and reporting have built a data warehouse on Snowflake. Customers that require a hybrid of these to support many different tools and languages have built a data lakehouse. And many customers prefer that teams own their data and adhere to standards (rather than a central data team) to manage infrastructure, and therefore have used Snowflake as a platform for their data mesh.

In keeping up with ever-evolving data management needs, we’re announcing new capabilities that support customers across all of these patterns.

Apache Iceberg for an open data lakehouse

The data lakehouse architecture emerged to combine the benefits of scalability and flexibility of data lakes with the governance, schema enforcement, and transactional properties of data warehouses. From the start, the Snowflake platform has been delivered as a service, consisting of optimized storage, elastic multi-cluster compute, and cloud services. Our table storage since we first launched in 2015 is actually a fully managed table format, implemented on top of object storage, similar to what the market may know today from open source as Apache Iceberg, Apache Hudi, and Delta Lake. Because Snowflake’s table format is fully managed, features like encryption, transactional consistency, versioning, and time travel are delivered automatically.

While many customers value the simplicity of fully managed storage and a single, multi-language and multi-cluster compute engine to power a variety of workloads, some customers would prefer to manage their own storage using open formats, which is why we’ve added support for Apache Iceberg. While there are other open table formats, we see Apache Iceberg as the leading open standard for table formats for many reasons, and therefore are prioritizing support of this format to best serve customers.

Iceberg Tables (in public preview soon) are a single table type that brings the easy management and great performance of Snowflake to data stored externally in an open format. Iceberg Tables also make it easier and cheaper to onboard without requiring upfront ingestion. To give customers flexibility for how they fit Snowflake into their architecture, Iceberg Tables can be configured to use either Snowflake or an external service like AWS Glue as the table’s catalog to track metadata, with an easy one-line SQL command to convert to Snowflake in a metadata-only operation.

Regardless of an Iceberg Table’s catalog configuration, many things remain consistent:

  • Data is stored externally in the customer’s provided storage bucket
  • Snowflake’s query performance is on average at least 2X better than External Tables
  • Many other features work including data sharing, role-based access controls, time travel, Snowpark, object tagging, row access policies, masking policies

And when Iceberg Tables use Snowflake as the table catalog for managing metadata, more benefits include:

  • Snowflake can perform write operations like INSERT, MERGE, UPDATE and DELETE
  • Automatic storage maintenance operations like compaction, snapshot expiration, and deleting orphan files
  • (optional) Automatic clustering for faster queries
  • Apache Spark can use Snowflake’s Iceberg Catalog SDK to read Iceberg Tables without requiring any Snowflake compute resources

Expanded semi-structured and unstructured data support for data lakes

A data lake is an appealing architecture pattern because of an object store’s ability to store virtually any file format, of any schema, at massive scale and relatively low cost. Rather than defining schema upfront, a user can decide which data and schema they need for their use case. Snowflake has long supported semi-structured data types and file formats like JSON, XML, Parquet, and more recently storage and processing of unstructured data such as PDF documents, images, videos, and audio files. Whether files are stored in Snowflake-managed storage (internal stage) or external object storage (external stage), we have new features to support these data types and use cases.

We’ve expanded our support for semi-structured data with the ability to easily infer the schema of JSON and CSV files (generally available soon) in a data lake. The schema of semi-structured data tends to evolve over time. Systems that generate data add new columns to accommodate additional information, which requires downstream tables to evolve accordingly. To better support this, we’ve added support for table schema evolution (generally available soon).

For use cases involving files like PDF documents, images, videos, and audio files, you can also now use Snowpark for Python and Scala (generally available) to dynamically process any type of file. Data engineers and data scientists can take advantage of Snowflake’s fast engine with secure access to open source libraries for processing images, video, audio and more.

Faster, more advanced SQL for a data warehouse

SQL is by far the most common language for data warehouse workloads, and we continue to push the boundary of the kinds of computation that can be accomplished with SQL. For example, with the new support for AS OF JOINs (in private preview soon), data analysts can now write much simpler queries that combine time series data. These use cases are common across financial services, IoT, and feature engineering use cases where joins on timestamps aren’t exact matches, but rather approximated by the nearest preceding or following record. We’re also improving our support for advanced analytics in Snowflake by increasing the file size limit for loading, in private preview soon. You can now load large objects (up to 128 MB in size), which are often needed in use cases involving natural language processing, image analysis and sentiment analysis.

We remain committed to improving performance and delivering cost savings for customers. With new and improved optimizations, customers will experience better performance and cost savings in many ways:

  • Ad-hoc queries on warehouses with memory-heavy ML use cases are now faster and more cost-effective with Query Acceleration Service for Snowpark Optimized Warehouses (generally available)
  • SELECT statements containing ORDER BY and LIMIT clauses are faster, especially on large tables, with top-k pruning (generally available soon)
  • Materialized View maintenance costs are reduced by more than 50% with new warehouse efficiencies (generally available)
  • Queries using non-deterministic functions like ANY_VALUE(), MODE() and more now benefit from a result cache to boost performance. Based on our analysis, certain query patterns resulted in a 13% reduction in job credits for impacted queries. (generally available)
  • INSERT statements are faster with support added in Query Acceleration Service (in private preview)
  • A new function to help estimate both upfront and ongoing maintenance costs for automatic clustering on a specific table (in private preview)

Get started

We’re excited for customers to have these new capabilities all in a single platform, allowing them to continue to build and adapt their architecture of choice with the Data Cloud. For any features above in private preview, please contact your Snowflake account manager to apply for access. For public preview or generally available features, please read the release notes and documentation to learn more and get started.

To learn more about how Snowflake supports the architecture patterns described in this blog post, visit our pages for data warehouse, data lake, data lakehouse, and data mesh.

Want to see these features in action? Check out the session from Snowday.

Forward Looking Statements
This press release contains express and implied forward-looking statements, including statements regarding (i) Snowflake’s business strategy, (ii) Snowflake’s products, services, and technology offerings, including those that are under development or not generally available, (iii) market growth, trends, and competitive considerations, and (iv) the integration, interoperability, and availability of Snowflake’s products with and on third-party platforms. These forward-looking statements are subject to a number of risks, uncertainties and assumptions, including but not limited to risks detailed in our filings with the Securities and Exchange Commission. In light of these risks, uncertainties, and assumptions, actual results could differ materially and adversely from those anticipated or implied in the forward-looking statements. These statements speak only as of the date the statements are first made. Except as required by law, Snowflake undertakes no obligation to update the statements in this press release. As a result, you should not rely on any forward-looking statements as predictions of future events.  

Any future product information in this press release is intended to outline general product direction. The actual timing of any product, feature, or functionality that is ultimately made available may be different from what is presented in this press release.  

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.