How to Configure AWS Glue with Snowflake for Data Integration
PLEASE NOTE: This post was originally published in 2018. It has been updated to reflect currently available products, features, and/or functionality.
The process of extraction, transformation, and load (ETL) is central to any data warehousing initiative. With advances in cloud data warehouse architectures, customers are also benefiting from the alternative approach of extraction, load, and transformation (ELT), where data processing is pushed to the database.
With either approach, the debate continues: Should you use a hand-coded method or leverage any number of the available ETL or ELT data integration tools? While there are advantages to both, and some will choose a “one or the other” approach, many organizations elect to use a combination of a data integration tool along with hand-coding.
Code provides developers with the flexibility to build using preferred languages while maintaining a high level of control over integration processes and structures. The challenge has been that hand-coding options are traditionally more complex and costly to maintain. However, with AWS Glue, developers now have an option to more easily build and manage their data preparation and loading processes with generated code that is customizable, reusable, and portable with no infrastructure to buy, set up, or manage.
In this blog post, we’ll cover how to leverage the power of AWS Glue with Snowflake and how processing is optimized through the use of query pushdown for ELT.
Why AWS Glue with Snowflake?
Snowflake customers now have a simple option to manage their programmatic data integration processes without worrying about servers, Spark clusters, or the ongoing maintenance traditionally associated with these systems. AWS Glue provides a fully managed environment that integrates easily with Snowflake’s data warehouse as a service. Together, these two solutions enable customers to manage their data ingestion and transformation pipelines with more ease and flexibility than ever before.
With AWS Glue and Snowflake, customers get the added benefit of Snowflake’s query pushdown, which automatically pushes Spark workloads, translated to SQL, into Snowflake. Customers can focus on writing their code and instrumenting their pipelines without having to worry about optimizing Spark performance. With AWS Glue and Snowflake, customers can reap the benefits of optimized ELT processing that is low cost and easy to use and maintain.
AWS Glue and Snowflake in action
Prerequisites
- The Latest Snowflake Spark Connector
- The Latest Snowflake JDBC Driver (Verify the JDBC supported version for the Spark Connector version you are using—see Release Notes)
- S3 bucket in the same region as AWS Glue
NOTE: AWS Glue 3.0 requires Spark 3.1.1 - Snowflake Spark Connector 2.10.0-spark_3.1 or higher, and Snowflake JDBC Driver 3.13.14 can be used.
Setup
- Log in to AWS.
- Search for and click on the S3 link.
- - Create an S3 bucket and folder.
- - Add the Spark Connector and JDBC .jar files to the folder.
- - Create another folder in the same bucket to be used as the Glue temporary directory in later steps (see below).
- Switch to the AWS Glue Service.
- Click on Jobs on the left panel under ETL.
- Add a job by selecting the Spark script editor option and clicking Create, then click on the Job Details tab.
- - Provide a name for the job.
- - Select an IAM role. Create a new IAM role if one doesn’t already exist and be sure to add all Glue policies to this role.
- - Select type Spark.
- - Select the Glue version (see note above for Glue version 3.0).
- - Select Python 3 as the language.
- - Click on Advanced properties to expand that section.
- - Give the script a name.
- - Set the temporary directory to the one you created in step 2c.
- - Under Libraries in the Dependent jars path, add entries for both .jar files from 2b.
NOTE: You have to add the full path to the actual .jars files. Example: s3://[bucket_name]/GlueJars/spark-snowflake_22.12-2.10.0-spark_3.1.jar,s3://[bucket_name]/GlueJars/snowflake-jdbc-3.13.14.jar
- Under Job parameters, enter the following information with your Snowflake account information. Make sure to include the two dashes before each key.
NOTE: Storing your account information and credentials this way will expose them to anyone with access to this job. This can be useful for testing purposes but it is recommended that you securely store your credentials as outlined in the “Store credentials securely” section.
NOTE: URL is the Snowflake account url after the https://; for example, abcd123.snowflakecomputing.com
- Click Save on the top right.
- Click on the top Script tab to enter a script.
Sample script
Use the following sample script to test the integration between AWS Glue and your Snowflake account. This script assumes you have stored your account information and credentials using job parameters as described in the chart above.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
## @params: [JOB_NAME, URL, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)
## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
"application" : "AWSGlue"
}
## Read from a Snowflake table into a Spark Data Frame
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[table_name]").load()
## Perform any kind of transformations on your data and save as a new Data Frame: df1 = df.[Insert any filter, transformation, or other operation]
## Write the Data Frame contents back to Snowflake in a new table df1.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[new_table_name]").mode("overwrite").save() job.commit()
Securing credentials
To securely store your account information and credentials, see the following article which describes how this is accomplished with EC2: How to Securely Store Credentials with EC2.
Conclusion
AWS Glue and Snowflake make it easy to get started and manage your programmatic data integration processes. AWS Glue can be used standalone or in conjunction with a data integration tool without adding significant overhead. With native query pushdown through the Snowflake Spark connector, this approach optimizes both processing and cost for true ELT processing. With AWS Glue and Snowflake, customers get a fully managed, fully optimized platform to support a wide range of custom data integration requirements.