Now in Public Preview: Processing Files and Unstructured Data with Snowpark for Python
Announced at Summit, we’ve recently added to Snowpark the ability to process files programmatically, with Python in public preview and Java generally available. With this new Snowpark capability, data engineers and data scientists can process any type of file directly in Snowflake, regardless if files are stored in Snowflake-managed storage or externally. Data engineers and data scientists can take advantage of Snowflake’s fast engine with secure access to open source libraries for processing images, video, audio, and more.
California Air Resources Board has been exploring processing atmospheric data delivered from four different remote locations via instruments that produce netCDF files. Previously, working with these large and complex files would require a unique set of tools, creating data silos. Now, with unstructured data processing natively supported in Snowflake, we can process netCDF file types, thereby unifying our data pipeline. Processing files in a Python UDF and Stored Procedure has piqued the interest of our data scientists and paves the way for automation of new, complex data pipelines.”
Mike Tuck
Why unstructured data?
Historically, organizations have relied heavily on structured tables and semi-structured data like JSON and XML for analytical workloads and applications. However, only because of legacy technology limitations making processing solutions terribly complex and expensive, the potential of unstructured data for analytics and applications has remained untapped. Customers are looking for ways to better utilize unstructured data, which could be in any file format such as doc, pdf, images (jpeg, png), audio (mp3), video, e-mails, or industry-specific formats such as raster data for geospatial, .las files for well log data, DICOM for medical, and many more.
Secure, unified processing with Snowpark
To address these challenges, we’ve added the ability to process unstructured files in Java (generally available), Python (public preview), and Scala (public preview) natively in Snowflake with Snowpark. Using Snowpark runtimes and libraries, you can securely deploy and process Python, Java and Scala code to build pipelines, ML models, and applications in Snowflake. This enables customers to consolidate their overall data processing and analytics needs across unstructured, semi-structured, and structured data within Snowflake, and benefit from unified security, governance, and collaboration.
Our team worked on a use case where sensitive data was submitted in PDF format and manually converted into tabular data for analytics. To improve this process, we knew we could scrape data from these PDFs using Python locally, but we wanted to centralize the processing and storage of this data in Snowflake. Using a Python UDF, we were able to accomplish this task. Snowpark allowed us to be able to reference the unique URL of each PDF that was added to the stream connected to our data lake, which unlocked the ability to process that PDF natively within Snowflake. From there, we were able to easily pull out the relevant data for our work and insert it into a Snowflake table for continued processing and analysis. Not only is this process conveniently centralized in Snowflake, but this sensitive data is kept secure in both motion and at rest.”
U.S. Statewide Public Health Organization
How to process unstructured data with Snowpark
With Snowpark, you can define processing logic in Java, Python, or Scala using a user-defined function (UDF), user-defined table function (UDTF), or stored procedures. Within the Python code, data engineers and data scientists can easily take advantage of the Python ecosystem of open source packages, such as PyPDF2 for PDF processing, or OpenPyXL to read excel files. These functions can then be called when reading files in an internal stage, external stage, or s3-compatible storage on-premises or in a private cloud. Files can be streamed as part of reading, which helps with ingesting and processing larger file sizes such as TB file size.
Below is an example that illustrates how to read a PDF file in Python by creating a UDF, importing the SnowflakeFile class, which is required to open the file.
create function pdf_read(file_url string)
returns variant
language python
runtime_version=3.8
packages = (‘snowflake-snowpark-python’’pypdf2’)
handler = 'pdfparser'
AS
$$
from snowflake.snowpark.files import SnowflakeFile
from pypdf2 import Pdfreader
import io
def pdfparser(file_url):
text = ""
file =io.BytesIO(SnowflakeFile.open(file_path,'rb').read())
reader = PdfReader(file, strict=False)
for page in reader.pages:
text += page.extract_text() + "\n"
return text
$$;
Once the above function is registered, other users like an analyst can call the function as part of a SQL query (like below) to extract the contents of a PDF file.
select pdf_read(build_scoped_file_url('@stage',’sample.pdf’));
As you noticed in the above example, the files can be passed as a reference in a dynamic manner during run time. You don’t need to statically declare the file reference upfront. This makes it easy to parameterize your data engineering pipeline, where any new files added to your storage account (or stage) can trigger a task that processes files and writes into tables.
To process all the files in the stage, you can run a SQL query like below, where multiple files can be processed at once using the power and elasticity of Snowflake’s engine.
select
relative_path
, file_url
, pdf_read(build_scoped_file_url('@my_stage', relative_path)) as parsed_text
from directory(@my_stage);
If you have defined a UDTF, then you can run a query like below to process multiple files at once.
select
t.*
from directory(@my_stage) d
, table(pdf_read_udtf(build_scoped_file_url(@my_stage, relative_path))) t;
Both UDFs and UDTFs take advantage of parallelism, where the multiple files can be read in parallel by the underlying warehouses. Depending upon the number of files, you can size your warehouse accordingly to process your files concurrently.
The output of UDFs and UDTFs can be stored as columns in a table for further analysis and file retrieval. The output can be flattened or nested, which can be done easily with Snowflake's native support for semi-structured data types.
Loading custom ML models and inferencing
Perhaps you have use cases where you’d like to use a custom machine learning model to perform inference on your batch of files. For example, you have trained a model to perform image classification or object detection, and you’d like to use that model to perform inference, extract attributes, and generate metadata in pipelines. For use cases like this, you can dynamically load model files as part of the function or procedure logic for model inferencing like below.
SnowflakeFile.open(
model_file,
’rb’,
required_scoped_url = False
) as f
Security and governance during file processing
Since Snowflake added support for unstructured data, file processing relies on existing file access mechanisms thereby maintaining consistency of security and governance for both unstructured data storage and processing. Files read with a function or procedure are resolved with the function or procedure owner’s context. Since, the owner and caller of the function can be two different personas, we require the file to be referenced through a scoped URL, which is valid within the same query.
For example, a developer or data engineer creates a UDF “dicom_process”, making them the UDF owner, to process DICOM image files, extracting patient name, doctor’s name, manufacturer and other information. An analyst needs to use that UDF, making them the caller, to analyze 100,000 or more DICOM images on a daily basis, extracting measurements, calculating statistics, and extract relevant metadata such as patient demographics, acquisition parameters and study information. The analyst has read access to the DICOM image files and can create a scoped URL for files using build_scoped_file_url. The scoped URL can be passed as an argument to the UDF like shown in the query below.
select dicom_process(build_scoped_file_url('@stage',’sample.dcm'));
The scoped URL is valid within the same query, and the file will be accessible by the owner of the function. As such, the below Python code will run in the UDF owner’s context to open the file, and it will work since the scoped URL is valid within the same query.
with SnowflakeFile.open(file_url, 'rb') as f:
There could be a scenario, where the UDF owner has to access their own file, for example to load a ‘configuration files’ or ‘machine learning models’ as part of the function. In that case, the UDF owner can just pass a keyword argument ‘required_scoped_url = False’ as part of SnowflakeFile.open, to read their own file.
config_file = '@stage/config_file'
SnowflakeFile.open(
config_file,
’rb’,
required_scoped_url = False
) as f
The same concept applies to UDTF, and Stored Procedures as well. Please refer to documentation for more information.
How to get started
You can get started with unstructured data processing by following usage instructions in our documentation and quickstart guides, which includes step-by-step setup instructions:
- Getting Started with Unstructured Data
- Analyze PDF Invoices using Java UDF and Snowsight
- Extract Attributes from DICOM Files using a Java UDF
- Processing HL7 V2 Messages with Snowflake
We’re continuously looking for ways to improve, so if you have any questions or feedback about the product, make sure to let us know in the Snowflake Forums.