The Power of Secure User-Defined Functions for Protecting Shared Data
Many organizations would like to securely link and join their data with data generated by their partners, customers, and industry peers, but they have concerns about protecting personally identifiable information (PII), protected health information (PHI), and other forms of fine-grained data.
Snowflake recently launched a feature called Secure User-Defined Functions (Secure UDFs). Secure UDFs allow Snowflake users to link, join, and analyze fine-grained data with data from other Snowflake users while preventing the other parties from viewing or exporting the raw data.
This form of secure linking and joining moves towards the vision of enabling a global data economy, where the world’s data assets are leveraged together with artificial intelligence (AI) and business intelligence (BI) technologies to improve corporate and industry performance, increase transparency, and address common problems.
Secure Views and Their Limitations
Today, most data sharing in Snowflake uses secure views. Secure views are a great way for a data owner to grant other Snowflake users secure access to select subsets of their data. Secure views are effective for enforcing cell-level security in multi-tenant situations. This includes software-as-a-service (SaaS) providers granting access to each of their customers, while allowing each customer to see only their specific rows of data from each table. However, there is nothing preventing another user from running a SELECT * query against the secure view and then exporting all the data that’s visible to them.
In many situations, allowing a data consumer to see and export the raw data is completely acceptable. However, in other situations, such as when monetizing data, the most valuable analyses are often run against low-level and raw data, and allowing a data consumer to export the raw data is not desirable. Furthermore, when PII and PHI are involved, privacy policies and government regulations often do not permit providing data access to other parties.
The Power of Secure UDFs
Secure UDFs are small pieces of SQL or JavaScript code that securely operate against raw data, but provide only a constrained set of outputs in response to specific inputs. For example, imagine a retailer that wants to allow its suppliers to see which items from other suppliers are commonly sold together with theirs. This is known as market basket analysis.
Using the TCP-DS sample data set that’s available to all users from the Shares tab within Snowflake, we can run the following SQL commands to create a test data set and perform a market basket analysis:
create database if not exists UDF_DEMO;
create schema if not exists UDF_DEMO.PUBLIC;
create or replace table udf_demo.public.sales as
(select * from snowflake_sample_data.TPCDS_SF10TCL.store_sales
sample block (1));
select 6139 as input_item, ss_item_sk as basket_Item,
count(distinct ss_ticket_number) baskets
from udf_demo.public.sales
where ss_ticket_number in (select ss_ticket_number from udf_demo.public.sales where ss_item_sk = 6139)
group by ss_item_sk
order by 3 desc, 2;
INPUT_ITEM BASKET_ITEM BASKETS
6139 6139 1048
6139 7115 405
6139 7114 189
6139 9257 128
6139 9256 102
6139 7116 95
6139 9258 40
6139 23492 37
6139 40008 10
This example returns the items that sold together with item #6139. This example outputs only aggregated data, which is the number of times various other products are sold together, in the same transaction, with item #6139. This SQL statement needs to operate across all of the raw data to find the right subset of transactions. To enable this type of analysis while preventing the user who is performing the analysis from seeing the raw data, we wrap this SQL statement in a secure UDF and add an input parameter to specify the item number we are selecting for market basket analysis, as follows:
create or replace secure function
UDF_DEMO.PUBLIC.get_market_basket(input_item_sk number(38))
returns table (input_item NUMBER(38,0), basket_item_sk NUMBER(38,0),
num_baskets NUMBER(38,0))
as
'select input_item_sk, ss_item_sk basket_Item, count(distinct
ss_ticket_number) baskets
from udf_demo.public.sales
where ss_ticket_number in (select ss_ticket_number from udf_demo.public.sales where ss_item_sk = input_item_sk)
group by ss_item_sk
order by 3 desc, 2';
select * from table(UDF_DEMO.PUBLIC.get_market_basket(6139));
<returns same results as previous example>
We can then call this function and specify any item number as an input, and we will get the same results we received when running the SQL statement directly. Now, we can grant a specified user access to this function while preventing the user from accessing the underlying transactional data.
How to Share Secure UDFs
To share a secure UDF, we can then grant usage rights on the secure UDF to a Snowflake share. This gives other specified Snowflake accounts the ability to run the secure UDF, but does not grant any access rights to the data in the underlying tables.
use database UDF_DEMO;
create share if not exists UDF_DEMO_SHARE;
grant usage on database UDF_DEMO to share UDF_DEMO_SHARE;
grant usage on schema UDF_DEMO.PUBLIC to share UDF_DEMO_SHARE;
grant usage on function UDF_DEMO.PUBLIC.get_market_basket(number) to share UDF_DEMO_SHARE;
alter share UDF_DEMO_SHARE add accounts=<consumer account id>;
If we then log in to the other Snowflake account, we can run the secure UDF from the share using the second account’s virtual warehouse. However, from the second account, we cannot select any data from the underlying tables, determine anything about the names or structures of the underlying tables, or see the code behind the secure UDF.
use role accountadmin;
create database UDF_TEST from share <provider_account>.UDF_DEMO_SHARE;
grant imported privileges on database UDF_TEST to role PUBLIC;
use database UDF_TEST;
select * from table(UDF_TEST.PUBLIC.get_market_basket(6139));
INPUT_ITEM BASKET_ITEM_SK NUM_BASKETS
6139 6139 1048
6139 7115 405
6139 7114 189
6139 9257 128
6139 9256 102
6139 7116 95
6139 9258 40
6139 23492 37
6139 40008 10
describe share jtest2.UDF_DEMO_SHARE;
kind name
DATABASE UDF_TEST
SCHEMA UDF_TEST.PUBLIC
FUNCTION UDF_TEST.PUBLIC."GET_MARKET_BASKET(...)"
The secure UDF is essentially using the data access rights of its creator, but allowing itself to be run by another Snowflake account that has access rights to run it. With Snowflake Data Sharing, the compute processing for secure UDFs runs in the context of, and is paid for by, the data consumer using the consumer’s virtual warehouse, against the function provider’s single encrypted copy of the underlying data.
This ability to share a secure UDF enables a myriad of secure data sharing and data monetization use cases, including the ability to share raw and aggregated data and powerful analytical functions, while also protecting the secure UDF’s code. It also prevents other parties from directly viewing or exporting the underlying encrypted data.
Shared JavaScript Secure UDFs
The previous example is a SQL secure UDF. JavaScript secure UDFs, which are also shareable, operate in a slightly different manner. A shared JavaScript secure UDF is incapable of running any queries, and it cannot perform any other form of external I/O. However, it can perform standard JavaScript operations (without any external libraries), coded by the function’s provider, against data passed in by the function’s consumer. A SQL secure UDF, on the other hand, can run queries, but only against the provider’s data.
JavaScript secure UDFs are useful for data cleansing, address matching, and other data manipulation operations. In the next installment of this multi-part blog post, we’ll include an example of a JavaScript secure UDF for matching up data across two accounts.
More Uses for Secure UDFs
Without sharing the raw data, Secure UDFs are an incredibly powerful tool for providing multiple Snowflake users with the ability to ask certain types of questions by querying data in specific ways.
We look forward to seeing how you use the basic examples provided here as a guide for using secure UDFs in new and creative ways and playing an active role enabling the global data economy.