Apache Spark is an open-source, general purpose, cluster-computing system for large scale data processing.
But what use cases are a good fit for the Spark framework? Below are a few scenarios where using Spark makes sense:
Projects that involve massive sets of disparate data types (such as large TB structured data sets mixed with JSON)
Projects that have massive data volumes and also require quick (even in-stream) analysis
Projects without a budget for proprietary third-party tools
Why is Spark well-suited for the conditions mentioned above?
For those with budget concerns, it is an open-source framework that can run on commodity hardware
It is mainly in-memory (leveraging RRD - Resilient Distributed Datasets), speeding data accessibility and reducing Disk I/O latency
It features a highly extensive API that can drastically reduce data application development times
Spark is easy to program and users can can write simple, object-oriented queries within a distributed computing environment.
It contains eighty high level operators that aid in parallel application development
- It helps in graph processing for advanced machine learning, data science, and data mining applications
Snowflake and Spark
The Snowflake Connector for Spark enables connectivity to and from Spark. It provides the Spark ecosystem with access to Snowflake as a fully-managed and governed repository for all data types, including JSON, Avro, CSV, XML, machine-born data, and more. The connector also enables powerful integration use cases, including:
Complex ETL: Using Spark, you can easily build complex, functionally rich and highly scalable data ingestion pipelines for Snowflake. With a large set of readily-available connectors to diverse data sources, it facilitates data extraction, which is typically the first part in any complex ETL pipeline
Machine Learning: With Spark integration, Snowflake provides users with an elastic, scalable repository for all the data underlying algorithm training and testing. Processing capacity requirements or pipelines often fluctuate heavily with machine learning projects. Snowflake can easily expand its compute capacity to allow machine learning in Spark to process large amounts of data.