Apache Spark was designed to function as a simple API for distributed data processing in general-purpose programming languages. It enabled tasks that otherwise would require thousands of lines of code to express to be reduced to dozens.
Spark Components
The Spark ecosystem includes a combination of proprietary Spark products and various libraries that support SQL, Python, Java, and other languages, making it possible to integrate Spark with multiple workflows.
1. Apache Spark Core API
The underlying execution engine for the Spark platform. It provides in-memory computing and referencing for data sets in external storage systems.
2. Spark SQL
The interface for processing structured and semi-structured data. It enables querying of databases and allows users to import relational data, run SQL queries, and scale quickly, maximizing Spark's capabilities around data processing and analytics and optimizing performance. However, Spark SQL is not ANSI SQL, and requires users to learn different SQL dialect.
3. Spark Streaming
This allows Spark to process real-time streaming data ingested from various sources such as Kafka, Flume, and Hadoop Distributed File System (HDFS) and push out to file systems, databases, and live dashboards.
4. MLlib
A collection of machine learning (ML) algorithms for classification, regression, clustering, and collaborative filtering. It also includes other tools for constructing, evaluating, and tuning ML pipelines.
5. GraphX
A library to manipulate graph databases and perform computations unifying extract, transform, and load (ETL) process, exploratory analysis, and iterative graph computation.
6. SparkR
The key element of SparkR is SparkR DataFrames, data structures for data processing in R that extend to other languages with libraries such as Pandas.
SPARK ARCHITECTURE
Spark distributes data across storage clusters and processes data concurrently. Spark uses master/agent architecture, and the driver communicates with executors. The relationship between the driver (master) and the executors (agents) defines the functionality. Spark can be used for batch processing and real-time processing.
What Spark Does
At the time of creation, Apache Spark was considered versatile, scalable, and fast, making the most of big data platforms in the Hadoop ecosystem.
Processing
Spark is based on the concept of the resilient distributed dataset (RDD), a collection of elements that are independent of each other and that can be operated on in parallel, saving time in reading and writing operations. In 2015, the developers of Spark created the Spark DataFrames API to support modern big data and data science applications. It was modeled after data frames in R and Python (Pandas). Conceptually similar to a table in a relational database or a data frame in R/Python, a DataFrame is essentially a distributed collection of data organized into named columns. DataFrames can be constructed from a variety of sources, including structured data files, external databases, and existing RDDs (Resilient Distributed Datasets). The DataFrames construct offers a domain-specific language for distributed data manipulation and also allows for the use of SQL, using Spark SQL. At the time of creation, Apache Spark provided a revolutionary framework for big data engineering, machine learning and AI.
Flexibility
Spark code can be written in Java, Python, R, and Scala.
In-memory computing
Spark stores the data in RAM, allowing relatively quick access and analytics.
Real-time processing
Spark can process real-time streaming data, producing instant outputs.
Analytics
Spark comes with a set of SQL queries, machine learning algorithms, and other analytical functionalities.
Drawbacks of Spark
At the time it was created, Spark architecture provides for a scalable and versatile processing system that meets complex big data needs. It allowed developers to speed data processing while improving performance when using the Spark ecosystem. However, technology has evolved since then, and concerns such as security and governance have increased. In addition, Spark was birthed in the on-premises era, making managing and tuning more difficult compared to later cloud-built options. As a result, Spark has a few drawbacks to keep in mind.
Creates Silos
Since the Spark framework was often used for big data processing along with a traditional data warehouse, data is required to be moved around for different usage purposes. This creates a siloed approach with lots of pipeline complexity. With multiple data locations, organizations end up with multiple versions of “truth” and must deal with unnecessary data pipelines and a complex architecture. And since Spark does not have integrated data storage and is used primarily by parallel processing experts (for example, data engineers and data scientists), silos are also created from platforms used by analysts and other business users.
Users of Snowflake can address this issue easily, however. Snowflake’s Snowpark framework simplifies architecture and data pipelines by processing all data within the Snowflake Data Cloud—without moving it around. Different data users, from analysts to data scientists and data engineers, can collaborate on the same data in a single platform, which streamlines architecture by natively supporting everyone’s programming language of choice.
High Complexity of Managing Spark Clusters
Traditional data architecture is complex and costly to maintain. Organizations using Spark often pay for duplicated storage, redundant or unnecessary pipelines and processing, and long maintenance hours. Additionally, they face hidden costs from infrastructure and talent resources.
Snowflake’s Snowpark addresses this challenge as well by eliminating maintenance and overhead. Snowflake’s managed services have near-zero maintenance requirements, allowing teams to focus more on building and less on managing.
Insistent Governance and Security Policies
Platforms such as Apache Spark are insecure by default. For example, it allows developers to install any kind of libraries from third parties or from anywhere on the internet while leaving key security concerns such as unwanted network access unanswered. That leaves the door wide open for unwanted data exfiltration over the internet, unless teams spend considerable time manually adjusting security configurations between these platforms and cloud providers.
The traditional data architecture often used with Spark also creates significant security risks and governance issues due to the fact that data is being moved around and stored in siloed locations. Data silos bring inconsistent governance and security policies across different systems.
With Snowpark, administrators have full control over which libraries are allowed to execute inside the Java/Scala runtimes for Snowpark. In addition, Java/Scala runtimes on Snowflake’s virtual warehouses do not have access to the network and therefore avoid problems such as unwanted network access and data exfiltration by default, without any additional configuration.
With a streamlined architecture, organizations can implement a unified governance framework and set of security policies with one single platform.
Spark and Snowflake
Snowflake’s Snowpark Delivers the Benefits of Spark with None of the Complexities
Snowflake’s Snowpark framework brings integrated, DataFrame-style programming to the languages developers like to use and performs large-scale data processing, all executed inside of Snowflake. Here are just a few of the things that organizations are accomplishing using Snowpark.
Improve collaboration: Bring all teams to collaborate on the same data in a single platform that natively supports everyone’s programming language and constructs of choice, including Spark DataFrames.
Accelerate time to market: Enable technical talent to increase the pace of innovation on top of existing data investments with native support for cutting-edge open-source software and APIs.
Lower total cost of ownership: Streamline architecture to reduce infrastructure and operational costs from unnecessary data pipelines and Spark-based environments.
Reduce security risks: Exert full control over libraries being used. Provide teams with a single and governed source of truth to access and process data to simplify data security and compliance risk management across projects.
Thanks to Snowflake’s Snowpark, organizations can achieve lightning-fast data processing via their developer’s favorite programming languages and coding constructs. At the same time, they can enjoy all the advantages that the Snowflake Data Cloud offers.
In addition, Snowflake's platform can also connect with Spark. The Snowflake Connector for Spark keeps Snowflake open to connect to some complex Spark workloads.