Data Lake: A Definition
A data lake is an unstructured repository of unprocessed data, stored without organization or hierarchy. They allow for the general storage of all types of data, from all sources.
Data lakes typically store a massive amount of raw data in its native formats. This data is made available on-demand, as needed; when a data lake is queried, a subset of data is selected based on search criteria and presented for analysis.
Purpose
A data lake is a comprehensive way to explore, refine, and analyze petabytes of information constantly arriving from multiple data sources. One petabyte of data is equivalent to 1 million gigabytes: about 500 billion pages of standard, printed text or 58,333 high-definition, two-hour movies. Data lakes are for business users to explore and analyze petabytes of data.
Features
The characteristics of data lakes that distinguishes them from other types of big data storage are:
- Open to all data, regardless of type or source
- Data is stored in its original raw, untransformed state
- Data is transformed only when provided for analysis based on matching query criteria
Benefits
The source- and format-agnostic nature of data stored in a data lake offers several benefits for businesses, including:
- Flexibility, as data scientists can quickly and easily configure queries
- Accessibility, as all users can access all data
- Affordability, as many data lake technologies are open source
- Compatibility with most data analytics methods
- Comprehensive, combining data from all of an enterprise’s data sources including IoT
Data Lake versus Data Warehouse
Both data lakes and data warehouses are big data repositories. The primary difference between a data lake and a data warehouse is in how data is stored. A data warehouse typically stores data in a predetermined organization with a schema. A data lake does not have a predetermined schema. Also, whereas a data warehouse usually stores structured data, a data lake stores structured and unstructured data.
Comparison Chart: Data Lake and Data Warehouse
Data Lake | Data Warehouse | |
Type of data | Structured and unstructured from any source, raw | Structured, curated |
Schema | Not predetermined | Predetermined |
Typical users | Data scientists, developers, and data analysts | Data analysts |
Data Lakes in the Cloud
The sheer volume of big data—particularly the unfiltered data of a data lake—make on-premises data storage unrealistic. Apache Hadoop, Amazon S3, and Microsoft Azure Data Lake are a few cloud-based data storage service providers that enable data storage of varying sizes and speeds for processing and analysis.
Snowflake as Data Lake
Snowflake’s platform provides the benefits of data lakes and the advantages of data warehousing and cloud storage. With Snowflake as your central data repository, your business gains best-in-class performance, relational querying, security, and governance. Alternatively, store your data in cloud storage from Amazon S3 or Azure Data Lake and use Snowflake to accelerate data transformations and analytics.