Snowflake Launches Unstructured Data Support in Public Preview
From Day 1, Snowflake has supported structured and semi-structured data. Snowflake has provided exceptional performance for those data types and has been a pioneer in processing them. Today, Snowflake is adding support for unstructured data to allow customers to deliver more use cases with a single platform. The support for unstructured data management includes built-in capabilities to store, access, process, manage, govern, and share unstructured data in Snowflake. Now you can get all the benefits of the Snowflake Data Cloud with performance, concurrency, and scale for unstructured data.
To see demos of the functionality described below, watch our webinar on-demand, 7 Ways to Start Using Unstructured Data in Snowflake.
Customer Use Cases
Using unstructured data such as images, videos, audio, and documents has proliferated in the past few years. Many organizations have already adopted Snowflake as their data lake or their central repository for analytics data. As the use of unstructured data in the industry increased, these organizations felt the need to take advantage of unstructured data to remain competitive. They requested that Snowflake build support for unstructured data so they can derive insights from unstructured data and benefit fully from its potential.
Some of the use cases are described below. This is not an exhaustive list by any means, but it is provided to give you an idea of different ways organizations derive value from unstructured data.
- Store PDF documents to extract key/value pairs for the purpose of analytics.
- Store call center recordings (audio files) to derive insights such as sentiment analysis.
- Run machine learning on medical images (DICOM) to derive insights from them for research purposes.
- Store screenshots of documents such as insurance cards or prescription pills and run an optical character recognition process on them to extract text for analytics.
As you can see above, there are a variety of use cases related to unstructured data, and support for them would enable customers to continue using Snowflake as the central repository of data in their data architecture.
“Our data team analyzes images of our products to identify features and suggest similar products. With Snowflake’s support for unstructured data, we were able to simplify our architecture and create a central place to easily query any product data, including product images, orders, web tracking, and more. We are excited about the new use cases this enables for our product image files, and we plan to utilize Snowflake’s data sharing capabilities on this data in the near future.”
— Laurent Vandenbeusch, Data Tech Leader, Tape à l'oeil
“Leveraging semi-structured and unstructured data, and harnessing complex data interrelationships between them, has been a challenge for us for a few years. Using Snowflake’s recent feature release supporting unstructured data, my team has been effectively able to mash up images with structured relational data within Snowflake and Microsoft Azure to address several unique business use cases and build richer insights using data visualization tools such as Tableau.”
—Richie Bachala, Global Data Engineering Manager, Fortune 200 Global Manufacturing Company
Storage and Governance of Unstructured Data
Storing and governing unstructured data is one of the most important tasks for a data architecture administrator. In Snowflake, you can store the unstructured data files in an internal stage or an external stage. Once the files are stored, you can access them easily using a simple GET REST API.
The URL format for the file is https://<account>.snowflakecomputing.com/api/files/<db_name>/<schema_name>/<stage_name>/<file_path>.
Since the Snowflake stages are cloud-agnostic, data engineers can store files and retrieve them in a cloud-agnostic way.
Governance for unstructured files is also performed using simple cloud-agnostic commands. There is no need to manage complex security or governance policies in each cloud provider’s identity and access management (IAM) system. Securing the files can be done using simple GRANT and REVOKE statements, for example:
grant read on @my_stage to role data_analyst;
grant write on @my_stage to role data_engineer;
Accessing Unstructured Data Using File URLs
The unstructured data stored in Snowflake stages can be accessed via file URLs. There are three types of URLs that can be generated for unstructured data files stored in stages. These URLs are generated using file functions.
Pre-signed URLs: As the name suggests, pre-signed URLs are already authenticated. Users can simply download the files using pre-signed URLs.
Stage file URLs: A stage file URL requires a user to be authenticated with Snowflake and requires the user to have read privileges on the stage.
Scoped file URLs: A scoped file URL can be generated for a user to give the user short-lived, scoped access to the file without giving privileges on the stage.
More information about the different types of file URLs can be found in the documentation.
Directory Tables
Blob storage services provided by the public cloud providers have been the default storage for unstructured data files for many years. However, they have many limitations for analytics use cases. Listing files in the blob storage location takes a really long time, and there is only prefix-based search. Customers can't search for *.pdf files in their buckets.
With unstructured data management in Snowflake, we are launching built-in directory tables that provide an up-to-date tabular view on the file catalog in stages. Since the file catalog is now available as a table, users can perform very powerful tasks with exceptional performance. Listing files becomes a simple select * command on the directory table. Users can run queries with complex regex expressions for more-advanced searches of files.
Lastly, users can build a table stream on top of a directory table, which is a powerful building block to create pipelines for processing unstructured data. Table streams will keep track of all new files since the last processing job was executed and, hence, allow users to build a continuous data pipeline to process unstructured data files.
Processing Unstructured Data
As mentioned above, customers can build continuous data pipelines to process unstructured data. The processing of files can be done using external functions. Snowflake supports external functions in AWS, Azure, and GCP clouds. The documentation about external functions can be found here.
Customers can author external functions for various use cases to perform tasks such as the following:
- Extract text from images
- Process images and recognize labels using machine learning services
- Process PDF files and extract key-value pairs
Customers can easily implement a continuous data pipeline to process unstructured data files using directory tables and external functions. Imagine a customer has an external function in Snowflake called extract_pdf_data(file url). This function parses a PDF file and returns a variant that contains key-value pairs extracted from a PDF document.
Now the customer can run a pipeline like the following to continuously execute and process the PDF documents.
------ SETUP ------
-- Create directory table and select files from directory table
create stage my_documents directory = (enable = true)
encryption = (type = 'snowflake_sse');
-- put files in the stage
put file:///<path> @my_documents;
select * from directory(@my_documents);
-- Create a stream on directory table
create stream documents_stream on stage my_documents;
------ CONTINUOUS PROCESS ------
-- Add new files to a stage
put file:///<path> @my_documents;
-- Select new files added to the stage
select file_url from documents_stream;
-- Process new PDF files using a UDF and store variants in document_data table.
create task process_documents
as insert into document_data
select extract_pdf_data(file_url), relative_file_path
from documents_stream;
Snowflake enables data scientists to work with their tool of choice to process unstructured data, which also reduces the complexity of making results easily available for others to use. Here are a few examples:
- Leverage machine learning to extract semi-structured data from PDF documents such as purchase orders, invoices, financial statements, and forms with Impira
- Assigning keywords to images, videos, or documents using Clarifai’s AI-powered automated metadata generation
- Automatically locating and extracting data from audio, video, text, and other unstructured sources with Veritone’s aiWARE ecosystem of ready-to-deploy AI models
- Replicate files and file metadata into Snowflake with Hammerspace
Very soon, we will roll out an improvement to do file processing using Java UDFs in Snowflake. Please stay tuned for any updates related to that!
Securely Sharing Unstructured Data
Data sharing is one of the core value propositions of Snowflake’s Data Cloud. That's why we prioritized supporting the sharing of unstructured data above many other functionalities. Users can create secure views on directory tables and share those secure views with other accounts. This is the best way to share unstructured data files without making physical copies of them or sharing credentials with consumers.
create secure view files_v as
select build_scoped_file_url(@my_stage, relative_file_path)
from directory(@my_stage);
grant select on view files_v to share s;
Key Takeaways
Using the newly launched unstructured data management functionality, customers can now store unstructured data files in Snowflake stages and govern the data using simple RBAC commands. Customers can enable a built-in directory table to do quick listing, searching for the files in their stages. Customers can build continuous data pipelines to process unstructured data files using streams on directory tables and external functions. Lastly, they can share unstructured data using data sharing and Snowflake Data Marketplace.
The documentation for the functionality described in this blog post can be found here.
To see demos of the functionality mentioned in this blog post, watch our webinar on-demand, 7 Ways to Start Using Unstructured Data in Snowflake.