Spark and Spark SQL

10 min readOct 2, 2020

What is Spark? What are the differences between Spark and Cassandra? How was Spark derived from Hadoop? What are some important notes from the documentation about Spark SQL (A module for using Spark)?

(1) Spark

Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

How a spark Application runs on a cluster:

A Spark application runs as independent processes, coordinated by the SparkSession object in the driver program.
The resource or cluster manager assigns tasks to workers, one task per partition.
A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets across iterations.
Results are sent back to the driver application or can be saved to disk.

There are several resource/cluster managers that can be supported:

Spark Standalone — a simple cluster manager included with Spark
Apache Mesos — a general cluster manager that can also run Hadoop applications
Apache Hadoop YARN — the resource manager in Hadoop 2
Kubernetes — an open source system for automating deployment, scaling, and management of containerized applications

How are queries executed? The RDDs computed here is different from the inefficient one (explained below). It is the optimised result.

Push down filter and then do Join operation on a smaller dataset after filtering.

Use Cases:

Spark is often used with distributed data stores such as HPE Ezmeral Data Fabric, Hadoop’s HDFS, and Amazon’s S3, with popular NoSQL databases such as HPE Ezmeral Data Fabric, Apache HBase, Apache Cassandra, and MongoDB, and with distributed messaging stores such as HPE Ezmeral Data Fabric and Apache Kafka.

Stream processing: From log files to sensor data, application developers are increasingly having to cope with “streams” of data. This data arrives in a steady stream, often from multiple sources simultaneously. While it is certainly feasible to store these data streams on disk and analyze them retrospectively, it can sometimes be sensible or important to process and act upon the data as it arrives. Streams of data related to financial transactions, for example, can be processed in real time to identify– and refuse– potentially fraudulent transactions.

Machine learning: As data volumes grow, machine learning approaches become more feasible and increasingly accurate. Software can be trained to identify and act upon triggers within well-understood data sets before applying the same solutions to new and unknown data. Spark’s ability to store data in memory and rapidly run repeated queries makes it a good choice for training machine learning algorithms. Running broadly similar queries again and again, at scale, significantly reduces the time required to go through a set of possible solutions in order to find the most efficient algorithms.

Interactive analytics: Rather than running pre-defined queries to create static dashboards of sales or production line productivity or stock prices, business analysts and data scientists want to explore their data by asking a question, viewing the result, and then either altering the initial question slightly or drilling deeper into results. This interactive query process requires systems such as Spark that are able to respond and adapt quickly.

Data integration: Data produced by different systems across a business is rarely clean or consistent enough to simply and easily be combined for reporting or analysis. Extract, transform, and load (ETL) processes are often used to pull data from different systems, clean and standardize it, and then load it into a separate system for analysis. Spark (and Hadoop) are increasingly being used to reduce the cost and time required for this ETL process.

(2) What are the differences between Spark and Cassandra?

https://stackshare.io/stackups/cassandra-vs-spark#:~:text=It%20is%20designed%20to%20perform,under%20%22Big%20Data%20Tools%22.

Cassandra vs Apache Spark: What are the differences?

What is Cassandra? A partitioned row store. Rows are organized into tables with a required primary key. Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.

What is Apache Spark? Fast and general engine for large-scale data processing. Spark is a fast and general processing engine compatible with Hadoop data. Programming languages supported by Spark include: Java, Python, Scala, and R. Application developers and data scientists incorporate Spark into their applications to rapidly query, analyze, and transform data at scale. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Cassandra belongs to “Databases” category of the tech stack, while Apache Spark can be primarily classified under “Big Data Tools”.

(3) How was Spark derived from Hadoop?

This is a great article detailing the history of Spark. Highly recommended for understanding the relation between Spark and Hadoop.

HPE Developer | Spark 101: What Is It, What It Does, and Why It Matters

"authorDisplayName": "Carol McDonald", "publish": "2018-10-17T08:00:00.000Z", "tags": "spark" In this blog post, we…

developer.hpe.com

The goal of the Spark project was to keep the benefits of MapReduce’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use. The advantages of Spark over MapReduce are:

Spark executes much faster by caching data in memory across multiple parallel operations, whereas MapReduce involves more reading and writing from disk.
Spark runs multi-threaded tasks inside of JVM processes, whereas MapReduce runs as heavier weight JVM processes. This gives Spark faster startup, better parallelism, and better CPU utilization.
Spark provides a richer functional programming model than MapReduce.
Spark is especially useful for parallel processing of distributed data with iterative algorithms.

(4) Spark SQL

This section of the article below is focused on Spark SQL. It is written based on:

Spark SQL and DataFrames - Spark 3.0.1 Documentation

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by…

spark.apache.org

What is Spark SQL?

Spark SQL is Spark’s module for working with structured data, either within Spark programs or through standard JDBC and ODBC connectors.
Recall the diagram below. Spark SQL is simply one of the four available module.

It is used for structured data processing. Unlike the basic Spark RDD API (Java Object Datatype), the interfaces provided by Spark SQL provides Spark with more information about the structure of both the data and the computation being performed.

How to use Spark SQL?

We can interact with Spark and manipulate data using SQL query or Dataset/DataFrame API using the command-line or over JDBC/ODBC.

Spark SQL is used to execute SQL queries. It can be used to read data from an existing Hive installation.
The results of the SQL query will be returned as a Dataset/DataFrame.

What are Resilient Distributed Dataset RDDs?

RDD is a collection of object memory. It is partitioned into smaller memory. Internally, spark distributes the data in RDD, to different nodes across the cluster to achieve parallelism.

Characteristics:

Distributed Data Abstraction. Imagine you have a lot of data and they are stored in partitions stored in different places. You can write queries such that they execute in parallel in these different partitions.
Resilient & Immutable. You can modify the data as you read it and perform transformation. You get from one RDD to the next RDD with each new transformation and these get recorded as a lineage of where you came from. If something goes wrong, you can trace back your steps. They are immutable because each time a transformation happen, the original RDD is unchanged. At any point, we can recreate our RDD.

3. Compile-time Type-Safe.

4. Unstructured/Structured Data: Text (logs, tweets, articles, social)

5. Lazy. Transformation don’t materialise until an Action is performed. Then all the transformations will get executed.

Some Transformations:

Use RDDs when….

You want to control how to manipulate the dataset
You have unstructured data
Manipulate data with lambda functions
Sacrifice optimisation, performance & inefficiencies because of your own bad code :( Spark does not help you to do optimisation, you do it yourself by your code.
Problem: Slow execution with Java…, most likely to have inefficient code by developers.

What are Datasets and DataFrames?

Since there could be some problems with RDD code, an alternative is to use DataFrame/Dataset APIs instead.
A Dataset is a distributed collection of data. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

Getting Started

SparkSession class: Entry point to all functionality in Spark. An application can have multiple SparkSessions.

It allows programming with DataFrame and Dataset APIs.

To create a SparkSession class, we have to use SparkSession.builder().

SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup.

Creating DataFrames: With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.

Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection-based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.

DataFrame Operations

in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.

This is telling what you want to do, rather than how to do with RDD

Same as above picture except that this is in SQL. The effect is the same.

Comparing RDD with DataFrame API:

DataFrames are faster than RDDs.
Datasets < Memory RDDs

Creating Datasets

Datasets are similar to RDDs, however, instead of using Java Serialisation, they use a specialised encoder to serialise the objects for processing or transmitting over the network.

While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

When to use DataFrames & Datasets

When you want to do High level API
Strong Type Safety
Ease of use & readability
Express What-To-Do
When you have structured data schema
Let spark do code optimisation & performance

In Conclusion

Scala Syntax, Coding Time!

Prelude꞉ A Taste of Scala

Our hope in this book is to demonstrate that Scala is a beautiful, modern, expressive programming language. To help…

docs.scala-lang.org

Spark RDD APIs — An RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner. Thus, speed up the task. Follow this link to learn Spark RDD in great detail.
Spark Dataframe APIs — Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. Follow this link to learn Spark DataFrame in detail.
Spark Dataset APIs — Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner. Follow this link to learn Spark DataSet in detail.