Getting Started With Apache Spark

Puja Kose
8y
5k
0
3

Article

In Big Data, Hadoop components such as Hive (SQL construct), Pig ( Scripting construct), and MapReduce (Java programming) are used to perform all the data transformations and aggregation. Now, with Apache Spark, the same is being achieved with many more advantages like unified API performance, support for multiple languages, 10X-100X faster than MapReduce. Spark provides a single platform with SQL, Scripting, as well as the programming construct.

Big Data (Setting up the context)

The amount of data has grown considerably in recent years due to the growth of social networking, education, surveillance cameras, healthcare, business, satellite images, manufacturing, online purchasing, research analysis, banking, bioinformatics, Internet of Things, criminal investigation, media, information technology etc. This huge volume of data in the world has created a new field of data processing which is called Big Data.

Data can be private or public

Private data includes
Surveys/Questionnaire
Clicks
Messages
Transaction
Page views
Purchases
Public data includes-
Tweets
Blogs
Reports
Comments
Reviews

So, to do something meaningful with the data, we have to convert the unstructured data which is messy and semantically complex into structured data which is clean and easy to consume. This is called Data Processing.

Data Processing Tasks

Parsing fields from the text.
Accounting for missing values.
Identifying and investigating anomalies.
Summarizing using tables and charts.

The complexity of the data can be measured by messiness and speed of scaling of data as explained in the below points.

Spreadsheets

Low data collection frequency.
10-100s of rows per day.
Sometimes involves manual data collection.
Many files.

Database

High frequency of collection .
100k rows per day .
Programmatically corrected.
ACID

Distributed Computing

A very high frequency of data collection.
Billions/Millions of rows per day.
Files stored across a cluster of machines.
Many many files (Web pages, log files).

Tools for Data Processing

Apache Spark

Apache Spark is an open-source, lightning, cluster-computing framework. It is an engine for data processing and analytics.

Features

Speed
Support Multiple Languages.
Advanced Analytics.

Characteristics

General Purpose

Exploring
Cleaning and preparing.
Applying machine learning.
Building Data Applications
.

Interactive Environment

Called REPL (Read-Evaluate-Print-Loop).
Interactive environment.
Fast Feedback.

Distributed Computing

Process data across a cluster of machines.
Integrate with Hadoop.

In order to work with Spark we have to use Spark APIs like,

Scala
Python
Java

Almost all the data is processed using specific data structures called RDDs (Resilient Distributed Datasets).

RDDs are the main programming abstraction in spark.
With RDDs, you can interact and play with billions of rows of data.
RDDs are in-memory collections of objects.
A storage system that stores the data to be processed.
Cluster manages to help spark run tasks across a cluster of machines.

Components of Spark

Spark Core- Computing Engine.
Storage System-Stores the data to be processed. For storage system, you can use a Local file system or HDFS.
Cluster manager- Help spark run tasks across a cluster of machines. For Cluster manager, you can use built-in cluster manager or YARN (Yet-Another-Resource-Locator).

Storage System and Cluster manager, both are plug and play components.

Apache Spark Ecosystem

In the next article, we will discuss more regarding RDDs and will learn how to load a data set.