Introducing Snowpark pandas API: Run Distributed pandas at Scale in Snowflake

Python’s popularity has grown significantly, quickly becoming the preferred language for development across machine learning, application development, pipelines and more. At Snowflake we are deeply committed to delivering a best-in-class platform for Python developers. In line with this commitment, we're thrilled to announce the public preview support of Snowpark pandas API, enabling seamless execution of distributed pandas at scale in Snowflake.

Snowflake customers are already harnessing the power of Python through Snowpark, a set of libraries and code execution environments that run Python and other programming languages next to your data in Snowflake. With Snowpark's existing DataFrame API, users have access to a robust framework for lazily evaluated, relational operations on data, closely resembling Spark's conventions. In April 2024, Snowflake customers ran approximately 55 million queries in Snowpark on average each day for a spectrum of large-scale data processing tasks in data engineering and data science. Now, with the expansion of Snowpark to provide a pandas-compatible API layer, with minimal code changes, users will be able to get the same pandas-native experience they know and love with Snowflake’s performance, scale and governance.

Why introduce a distributed pandas API?

pandas is the go-to data processing library for millions worldwide, including countless Snowflake users. However, pandas was never built to handle data at the scale organizations are operating today. Running pandas code requires transferring and loading all of the data into a single in-memory process. It becomes unwieldy on moderate-to-large data sets and breaks down completely on data sets that grow beyond what a single node can handle. We know organizations work with this volume of data today, and Snowpark pandas enables you to execute that same pandas code, but with all the pandas processing pushed down to run in a distributed fashion in Snowflake. Your data never leaves Snowflake, and your pandas workflows can process much more efficiently using the Snowflake elastic engine. This brings the power of Snowflake to pandas developers everywhere.

Benefits of Snowpark pandas API

Accelerated and seamless development: Snowpark pandas overcomes the single-node memory limitation of traditional pandas, enabling developers to move effortlessly from prototype to production without encountering out-of-memory errors or having to rewrite pandas code to other frameworks (e.g. Spark, Snowpark DataFrames API or SQL), providing smooth and accelerated development cycles.

Meeting Python developers where they are: Snowpark pandas API preserves the same pandas API signatures and dataframe semantics that make pandas so easy to use and popular. No new syntax to learn or heavy amounts of code to rewrite.

Security and governance: Data does not leave Snowflake’s secure platform. The Snowpark pandas API pushes down the compute to where the data lives and brings uniformity within data organizations to how data is accessed, allowing for easier auditing and governance.

No additional compute infrastructure to manage and tune: The solution leverages the Snowflake compute engine and leverages pre-existing query optimization techniques within Snowflake. End users need not spin up, manage or tune any additional compute infrastructure.

Try it for yourself! Get started in less than 2 minutes by following this quickstart.

How does Snowpark pandas API work?

Snowpark pandas leverages the open source Modin API as the frontend client layer to maintain the exact pandas API signatures and preserve the dataframe semantics that have made pandas popular and easy to use. However, behind the scenes, Snowpark pandas operates differently. Instead of interacting with an in-memory pandas dataframe, under the hood, DataFrame operations are transparently converted into SQL queries that get pushed down and benefit from Snowflake's robust and powerful compute engine. This means you can continue using pandas syntax while benefiting from Snowflake's battle-tested, scalable and heavily optimized data infrastructure to execute your pandas code in a distributed fashion.

Furthermore, you have the flexibility to incorporate custom Python logic as User Defined Functions (UDFs) and leverage popular open source packages already preinstalled in Snowflake. This allows you to utilize pandas' versatile apply() function to process data along DataFrame or Series axes with ease, whether it be applying built-in Python functions, lambda functions or custom user-defined functions.

Figure 3.
Native pandas operations are transpiled and pushed down to run as SQL queries.
Custom Python code is serialized and pushed down to run in a secure, sandboxed Python environment.

As of this blog’s writing, the Snowpark pandas API covers most popular pandas API functionality, with ongoing efforts to expand support. Furthermore, we will be looking into integrating with downstream third-party OSS libraries and more. Give it a try and let us know your feedback by emailing us at [email protected].