JUN 15, 2021/6 min readProduct and Technology

Welcome to Snowpark: New Data Programmability for the Data Cloud

At Snowflake Summit 2021, we announced that Snowpark and Java functions were starting to roll out to customers. Today we’re happy to announce that these features are available in preview to all customers on AWS today.

These features represent a major new foray into data programmability, enabling you to more easily make Snowflake’s platform do more for you.

Basecamp

Snowflake started its journey to the Data Cloud by completely rethinking the world of data warehousing to accommodate big data. This was no small feat, but a tip-to-toe reworking of how a reliable, secure, high-performance, and scalable data-processing system should be architected for the cloud.

As Snowflake grew the Data Cloud, we naturally needed to expand the ways users interact with the system. In the data warehousing world, SQL is the lingua franca, but not every developer wants to write in SQL, nor does SQL naturally handle every data programmability problem. In addition, data warehousing systems limit the kinds of operations people can perform. This led users to pull their data into other systems for these tasks, adding cost, time, and complexity, while hurting security and governance.

But what to do about this? One option would be to create a new system to tackle these new scenarios. But that would mean a new system to manage. It would mean that users need to choose which system to use for each task—or part of a task. And it would mean that different users of different systems would need to integrate those systems to work together.

It would mean complexity, and that’s not the Snowflake way.

Instead, we thought deeply about how to maintain the simplicity and power of Snowflake for data warehousing, while building extensibility into the engine for broader data programmability. And we thought about the right libraries to enable deep, streamlined language integration, allowing more people to work natively with Snowflake to accomplish their tasks.

Snowflake’s Data Cloud blows the data warehousing system wide open. One simple, seamless system with reliability, security, performance, and scale: that’s the Snowflake way.

Read on for details about these new features—and what’s to come.

Snowpark

Snowpark is a new developer experience that we’re using to bring deeply integrated, DataFrame-style programming to the languages developers like to use, starting with Scala. Snowpark is designed to make building complex data pipelines a breeze and to allow developers to interact with Snowflake directly without moving data.

Let’s take a look at some of what makes Snowpark special.

With Snowpark, developers can build queries using DataFrames right in their code, without having to create and pass along SQL strings:

val sess = // get connection to Snowflake
 
val sales:DataFrame = sess.table("sales")
val line_items:DataFrame = sess.table("sales_details")
 
val query = sales.join(line_items, sales("id") === line_items("sid"))
                 .groupBy(line_items("product_id"))
                 .count()

Because Snowpark uses first-class language constructs, you get first-class support from your development environment: type checking, IntelliSense, and error reporting. Under the covers, Snowpark converts these operations into SQL that runs right inside Snowflake using the same high-performance, scalable engine you already know.

But Snowpark is a lot more than just a nicer way to write queries: You can bring along your custom logic as well. Let’s say that you have some custom code to mask personally identifiable information (PII):

val maskPii = (s:String) => {
  // Custom PII detection logic.
}

With Snowpark, you can very simply declare that this is a user-defined function (UDF), and then make use of it in your DataFrame operations:

val maskPiiUdf = udf(maskPii)
sess.table("emails")
    .withColumn("body", maskPiiUdf(col("body")))
    .show()

Snowpark takes care of pushing all of your logic to Snowflake, so it runs right next to your data. To host that code, we’ve built a secure, sandboxed JVM right into Snowflake’s warehouses—more on that in a bit.

But wait; there’s more! Let’s say that you wanted to apply your PII detection logic to all of the string columns in a table. With SQL, you’d have to hand-code a query for each table—or write code to generate the query. With Snowpark, you can easily write a generic routine:

val maskTable = (df:DataFrame) => {
  df.select(df.schema.map(field => 
    if (field.dataType == StringType) maskPiiUdf(col(field.name))
    else col(field.name))
  )
}

And with this generic routine in hand, you can mask all of the PII in any table with ease:

val maskedEmails = maskTable(sess.table("emails"))

Snowpark takes care of dynamically generating the correct query in a robust, schema-driven way.

Language integration, pushdown of custom logic, and very flexible query generation make Snowpark an incredibly powerful data-programmability tool, allowing you to write very complex data pipelines with ease.

Java Functions

As you saw above, Snowpark has the ability to push your custom logic into Snowflake, where it can run right next to your data. This is done by running the code in a secure, sandboxed JVM hosted right inside Snowflake’s warehouses.

But why let Snowpark developers have all the fun? SQL is still Snowflake’s bread and butter, so we made sure that SQL users can get the full benefit of the platform’s new capabilities through a feature we’ve creatively named Java functions.

With Java functions, developers can build complex logic that exposes a simple function interface:

public class Sentiment
{
   public float score(String text)
   {
       // Your sentiment analysis logic here.
   }
}

In building these functions, developers can make full use of their existing toolsets—source control, development environments, debugging tools—and they can bring along libraries as well. Find some useful code on GitHub? Use it in Snowflake!

To get it into SQL, all you need to do is build a JAR (or JARs), load it into Snowflake, and register a function:

create function sentiment(txt string) returns float
language java
imports = ('@jars/Sentiment.jar')
handler = 'Sentiment.score';

And with this in hand, any SQL user can use the logic you’ve built just like any other function:

select id, sentiment(body)
from emails;

We think this is pretty easy, and letting developers use their existing tooling is great for complex cases. But sometimes you have something basic to do, so we added simple, inline definitions as well:

create or replace function reverse(s string) returns string
language java
handler = 'Reverse.reverse'
target_path = '@jars/Reverse.jar'
as
$$
public class Reverse
{ 
    public String reverse(String s)
    {
        return new StringBuilder(s).reverse().toString();
    }
}
$$;

Powerful Java functions with the simplicity of Snowflake—but we’re still not done.

Snowpark Accelerated

You’ve seen some examples of the powerful things you can do with Snowpark and Java functions. These features open up some pretty exciting opportunities. But we’re also excited by the interest we’ve seen in our partner community.

As part of this launch, we’ve created the Snowpark Accelerated program to help highlight and support the incredible products our partners are creating using these features. So far, we have nearly 50 partners enrolled:

It’s been amazing to see the incredible things our partners have been building. They’re building seamless machine learning pipelines for fast, in-database model scoring. They’re bringing natural-language processing, data quality, and profiling routines right to the data. They’re creating analytic dashboards with visual query flows that would be hard to achieve in SQL, and they’re powering the next generation of ETL. And that’s just the beginning.

This Is a Journey

To get started, check out our documentation on Snowpark and Java functions. And follow a step-by-step lab guide to get started with Snowpark. We’re eager to see the amazing things you’ll build with them.

And while you explore these new features, we’re already working on the next round. At Summit we discussed some of the enhancements that we’re already working on, including logging support, table functions, and support for files.

We’re also working on Snowpark stored procedures, which will let you host your Snowpark pipelines in Snowflake for scheduling and orchestration. And we have some other tricks up our sleeves, too.

We’re just beginning this journey. Thanks for coming along with us.

Learn more about the author

Isaac Kunen

Senior Product Manager