Share But Don’t Show: Secure Data Collaboration with Data Clean Rooms
We’ve all seen the memes flying around the internet that ask you to reveal your age without actually saying it: “Without saying your age, what is something you remember from your childhood that a younger person would not understand?” Or the one that asks what the connection is between a pencil and a cassette tape (assuming, of course, that people even know what a cassette tape is).
For me, there are certain old advertisements that I remember. I can sing every word of hot dog ads, mayonnaise ads, ads for Shake and Bake, and other flashback foods from the 1970s. Or music that takes me back to a time and place that only those of my generation can remember—mostly sappy love songs that I listened to on repeat in my adolescent years and can still sing word for word many, many years later. It just happened to me on a recent trip to Dubai, when I broke into song in a colleague’s car.
These memes illustrate an age-old dilemma: How do we share information without actually revealing the underlying source? Can you demonstrate that you are older than others without actually revealing your age? Or can you determine which of two people is wealthier without knowing exactly how much each has? That’s Yao’s Millionaires’ problem.
In 1982, Andrew Yao, a computer scientist and computational theorist, posited the millionaires’ dilemma and demonstrated a mathematical solution. Several other academics offered alternative solutions. Yet all of these solutions ultimately require the millionaires to have an extensive knowledge of theoretical mathematics.
While the intelligence of the millionaires is not in doubt, most would not have the inclination to solve these equations. Fortunately, advances in computation and computer programming have delivered tools to vastly facilitate the work. Secure multi-party computation can be done in Snowflake.
Yao’s Millionaires Meet Modern Computation Methods
Let’s take a quick look at how the two millionaires, Bob and Alice, might be able to answer their question. In this example, Bob uses Snowflake access policies and predetermined queries to allow Alice to compare her wealth with his without revealing the underlying personal information (i.e., his actual wealth). Here are the steps:
Alice creates a table with her wealth value (alice.wealth).
Bob also creates a table with his wealth value (bob.wealth). But he then creates an access policy that specifies exactly which question Alice can ask and what answer she can receive. In other words, Alice can’t ask what the wealth value is. She can only ask about how the two relate to each other, and get the answers that Bob has specified:
when bob.wealth > alice.wealth then 'bob is richer'
when bob.wealth = alice.wealth then 'neither is richer'
else 'alice is richer' end
This enables Alice’s data to be compared to Bob’s but without her sharing her wealth value and without Bob sharing his. If Alice tries to ask anything other than the allowed questions, she will get no results.
Snowflake Data Clean Room Functionality Enables Sharing Without Showing
Of course, Snowflake customers are not comparing their wealth. But there are other use cases for which secure multi-party computation is extremely valuable. Examples extend across industries with use cases such as optimizing ad placement and marketing campaigns, identifying common transaction patterns to improve fraud detection, or tracking the patient lifecycle to improve medical outcomes.
Let’s take a look at an increasingly common use case. With the phase out of third-party cookies, companies are looking for alternative means of understanding their customers’ online habits. They are no longer able to track customers directly, and to adhere to privacy requirements and industry regulations, companies cannot just share their lists of customers. However, they might want to identify which customers they have in common. This use case is common in media and advertising. A brand wants to place an advertisement about a new product on a media platform, but wants to ensure they are reaching the right audience.
Imagine a sports brand that wants to let customers know about the release of new equipment or new shoes. Working with a media provider, the brand wants to target subscribers who also purchase their sports equipment or apparel. It wants its advertisements to run during shows that its customers are most likely to be watching. To do that, they need to determine overlapping customers and subscribers without sharing or exposing any customer’s data to each other.
Both parties have customer information. The brand knows its customers’ sporting habits, and often other demographic information each customer has shared. The media provider knows which shows the subscribers watch. The overlap reveals where to place the advertisements. In the fictional example below, for example, High Mileage Runners, a persona created based on frequency of shoe purchases, are most likely to be found watching Friendliest Catch. None of the sporty viewers are likely to be watching Cakemakers, a category which could be redacted to prevent reidentification due to the small sample size.
In this advertising scenario, there are often more participants, including third-party data providers whose data enriches the information known about customer segments, and agencies who develop and execute advertising campaigns. The functionality provided by the Snowflake platform—access policies, allowed statements, and data sharing—enables multiple parties to get the information they need to increase the value of these marketing investments.
Putting Data Clean Room Functionality Into Practice
This scenario is not just theoretical or fictional. The advertising arm of a large media company uses Snowflake data clean room capabilities to increase the value of ad placement for its advertisers by revealing insights that are not immediately obvious. For example, the media company recently acquired the rights to the programming of a major sports league, whose avid fans are an important target market for brands. But games are not all they watch. They over-index to specific comedy shows and series. Data clean rooms help brands target customer segments both inside and outside the game window.
In the new cookie-less advertising world, with data clean room capabilities enabled, multiple parties can share information without showing their underlying data. For example:
- Brands can leverage their growing first-party customer data with third-party enrichment to refine targets for marketing campaigns.
- Media companies can securely match their first-party subscriber data, across their own platforms, with brands’ customer data to improve ad placement.
- Agencies, or the advertisers or media companies themselves, can combine campaign logs, identity, attribution, and sales data to measure the performance of an ad campaign.
Data Clean Room Use Cases Across Other Industries
Advertising is certainly not the only use case. The benefits of data clean rooms extend across use cases and industries. Imagine medical researchers who want to understand if patients receiving certain diagnoses and prescribed treatments are actually filling prescriptions at the pharmacy. Or manufacturing companies sharing data with parts suppliers to reduce unscheduled downtime by predicting defects and improving product development.
Several examples of cross-industry collaboration also illustrate the innovative power of secure data collaboration. Retailers share consumer trends with consumer packaged goods manufacturers to help improve product development and forecast demand. Banks and retailers collaborate to offer brand-specific credit cards. Companies suffering from security attacks might want to compare the originating IP addresses to identify and collaborate to defend against common threats. The opportunities for data collaboration—and secure data collaboration—are endless.
We’ll be discussing this vast World of Data Collaboration at the upcoming Snowflake Summit 2022, June 13-16 in Las Vegas. In the meantime, take a look at this recent Snowflake webinar, Enabling The Future of Data Collaboration With Data Clean Rooms.