File writes made simple — or so it would seem
Dynamic file reads from Snowpark user-defined functions (UDFs) have been generally available since 2023. File writes are the obvious next step for our engineering team. The ability to do file writes unblocks common unstructured data cases like model generation, unstructured data transformation and data transfer use cases. The question was… how?
At the surface level, dynamic file writes might seem trivial. We’ve already got a generally available (GA) read API.
data = SnowflakeFile.open(stage_uri, “r”).read()
Adjust that API, add a new mode, and you’ve got writes: Voilà!
SnowflakeFile.write(stage_uri, “w”).read().write(‘data’)
… right?
Unfortunately, nothing in life is easy, and file writes are no exception. This simple approach breaks on several levels.
The first problem is that SELECT (of which UDFs may be a part) is read-only! Snowflake stages are considered part of the Snowflake ecosystem; having a read-only operation make changes to a Snowflake stage breaks a basic assumption on SELECT statements.
The second problem is side-effects. SELECT statements are intended to be idempotent. We can retry queries, but can you retry file writes to a stage? How can you roll back those changes if the query doesn’t complete? Side-effects are a no-go for SELECT statements.
The third problem: security! Snowpark UDFs run in the owner’s context — meaning that files accessed within the UDF use the permissions of the UDF author’s role. Let’s assume a malicious UDF author wants to steal data from the caller. They could just write the caller’s input straight to their own file inside the UDF. Now when the caller calls the UDF the author can write all the UDF’s input data to their own secret stage. The caller would have no idea!
This sounds bad — do we give up now?
No, fortunately! We have found a simple solution that works for everyone — the UDF author, the UDF owner and SELECT statements. We simply treat files as query data.
What does it mean to treat files as query data? Basically, when you write a file in a Snowpark UDF it gets treated as a query result — living right beside other query results. The UDF returns a reference (specifically, a scoped URL) to the file, so it can be located later. These file results are available for 24 hours, just like normal query results. And just like normal query results, Snowflake foots the bill for storage. However, due to the encryption type (and like normal query results), these files are only accessible via Snowflake — and not via a pre-signed URL.
Does this solve our problem? Well, results are not part of Snowflake state, so SELECT statements are now read-only. There are no side-effects — if the query fails, no files are copied anywhere. And lastly, the result stage is always owned by the query’s owner — so the UDF has no power to steal customer data! Perfect!
You might be wondering: How do you get this file reference to actually access the file? It’s returned by the UDF — or more specifically, the UDF returns a file, and Snowflake converts it to a reference, which is then returned as a normal old-fashioned query result. The result of your query is a reference to a file stored alongside the query results.
OK, sure, but 24 hours is not long enough for you? We have a solution for that too. We’ve extended a recent feature — ‘COPY FILES INTO’ — which allows you to copy files from one location to another. The list of files is generated from an underlying SELECT statement. You can even provide the new file location, if you so desire, or combine it directly with a UDF, so that it’s all in one single query.
COPY FILES INTO @my_stage FROM (
SELECT my_write_udf(input) from my_table
);
As an added bonus, copy files can actually be used from any Snowflake location, not just result files.
What does this look like in real life?
While this all may sound complicated, it’s fairly straightforward and flexible. Here is an example where we generate a PDF per partition using a UDTF (user-defined table function).
This UDTF takes a partition of data and writes it to a PDF. Then it returns the PDF.
CREATE OR REPLACE FUNCTION create_report_pdf(data string)
RETURNS TABLE (file string)
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER='CreateReport'
PACKAGES = ('snowflake-snowpark-python', 'fpdf')
IMPORTS = ('@fonts/DejaVuSans.ttf')
AS $$
from snowflake.snowpark.files import SnowflakeFile
from fpdf import FPDF
import shutil
import sys
import uuid
import_dir = sys._xoptions["snowflake_import_directory"]
class CreateReport:
def __init__(self):
self.pdf = None
def process(self, data):
if self.pdf == None:
# PDF library edits this file, make sure it's unique.
font_file = f'/tmp/DejaVuSans-{uuid.uuid4()}.ttf'
shutil.copy(f'{import_dir}/DejaVuSans.ttf', font_file)
self.pdf = FPDF()
self.pdf.add_page()
self.pdf.add_font('DejaVu', '', font_file, uni=True)
self.pdf.set_font('DejaVu', '', 14)
self.pdf.write(8, data)
self.pdf.ln(8)
def end_partition(self):
f = SnowflakeFile.open_new_result("wb")
f.write(self.pdf.output(dest='S').encode('latin-1'))
yield f,
$$;
We can partition our data by partitionKey
, use our UDTF to write the partitioned data to a PDF, and then copy it to a final location: @output_stage/{partitionKey
}.pdf
COPY FILES INTO @output_stage
FROM (SELECT reports.file, location || '.pdf'
FROM reportData, TABLE(create_report_pdf(item)
OVER (partition BY location)) AS reports);
For the full end-to-end example, you can check out our user guide here.
Conclusion
File writes are much more complex than file reads because of the limitations of SELECT statements. However, we’ve come up with a unique, flexible, and secure solution to support all those great use cases. We just treat files as query data!
This is now available for Python. You can find our documentation here.