Table formats such as Apache Iceberg are part of what make data lakes and data mesh strategies fast and effective solutions for querying data at scale. Choosing the right table format allows organizations to realize the full potential of their data by providing performance, interoperability, and ease of use. The Iceberg table format is unique among open-source alternatives, providing engine- and file format–agnosticism with a highly collaborative, transparent open-source project. In this post, we cover the benefits of using an open table format for analyzing large data sets and why Iceberg has quickly become one of the most popular open-source table formats.
Data lakes are ideal for storing massive amounts of semi-structured and unstructured data in native file formats. They provide organizations with a comprehensive way to explore, refine, and analyze petabytes of information constantly arriving from multiple data sources.
But the individual files in a data lake don’t contain enough information needed by query engines and other applications for effective pruning, time travel, schema evolution, and more. As a result, it’s difficult and time-consuming to perform these management tasks. Table formats solve these issues by providing metadata that enables capabilities and functionalities similar to those offered by SQL tables in a traditional relational database. They explicitly define a table, its schema, its history, and each file that composes a table. In addition, table formats such as Iceberg ensure ACID compliance, allowing multiple applications to safely work on the same data simultaneously.
Iceberg is an open-source table format that was originally developed by Netflix to address various challenges encountered within Apache’s Hive Hadoop project. After its initial development in 2018, Netflix donated Iceberg to the Apache Software Foundation as a completely open-source, openly managed project. It remedies many of the shortcomings of its predecessor and has quickly become one of the most popular open-source table formats.
The Iceberg table format offers many features to help power your data lake architecture.
Expressive SQL
Iceberg fully supports flexible SQL commands. This makes it possible to complete tasks such as updating existing rows, merging new data, and targeted deletes. Iceberg can be used to rewrite data files to enhance read performance and use delete deltas to quicken the pace of updates.
Schema evolution
Iceberg supports full schema evolution. Schema updates in Iceberg tables change only the metadata, leaving the data files themselves unaffected. Schema evolution changes include adds, drops, renaming, reordering, and type promotions.
Partition evolution
Partitioning divides large tables into small ones by grouping similar rows together, speeding up read and load times for queries that only need to access a portion of the data. A partition spec can evolve without changing the earlier data written with an earlier spec. The metadata associated with each partition version is stored separately.
Time travel and rollback
Iceberg’s time travel feature makes it possible to run reproducible queries on the same table snapshot and allows users the ability to inspect previous changes. This rollback capability allows users to easily walk back errors by resetting tables to their previous state.
Transactional consistency
Data stored in a data lake or data mesh architecture is available to multiple independent applications across an organization simultaneously. While this is a significant benefit, it can also come with substantial risks, especially if multiple users are writing to the same data at the same time. However, Iceberg enables ACID transactions at scale, allowing concurrent writers to work in tandem. Support for ACID ensures readers are not affected by partial or uncommitted changes from writers. When a writer commits a change, Iceberg creates a new, immutable version of the table’s data files and metadata.
Faster querying
Iceberg is designed for use with huge analytical data sets. It offers multiple features designed to increase querying speed and efficiency including fast scan planning, pruning metadata files that aren’t needed, and the ability to filter out data files that don’t contain matching data.
Vibrant community of active users and contributors
Iceberg is one of the Apache Software Foundation’s flagship projects. Its support for multiple processing engines and file formats including Apache Parquet, Apache Avro, and Apache ORC has attracted a diverse group of talented commercial users eager to contribute to its ongoing success.
The Snowflake Data Cloud makes it easy to execute big data workloads using numerous file formats, including Parquet, Avro, ORC, JSON, and XML. While Snowflake’s internal, fully managed table format greatly simplify the storage maintenance like encryption, transactional consistency, versioning, fail-safe, and time travel, , some organizations with regulatory or other constraints either are not able to store all of their data in Snowflake or prefer to store data externally in open formats. Apache Iceberg is currently supported in public preview by the Snowflake Data Cloud with Iceberg Tables. Iceberg Tables combine the performance and familiar query semantics of Snowflake tables with customer-managed cloud storage.
Snowflake users don’t have to contend with common barriers that stand in the way of realizing the true value of their data. Snowflake makes it possible to eliminate siloed data, securely share complex data sets internally and with outside data partners, and run large-scale analytics tasks on massive data sets quickly and efficiently.
See Snowflake’s capabilities for yourself. To give it a test drive, sign up for a free trial.