In Big Data, Hadoop components such as Hive (SQL construct), Pig ( Scripting construct), and MapReduce (Java programming) are used to perform all the data transformations and aggregation. Now, with Apache Spark, the same is being achieved with many more advantages like unified API performance, support for multiple languages, 10X-100X faster than MapReduce. Spark provides a single platform with SQL, Scripting, as well as the programming construct.
Big Data (Setting up the context)
The amount of data has grown considerably in recent years due to the growth of social networking, education, surveillance cameras, healthcare, business, satellite images, manufacturing, online purchasing, research analysis, banking, bioinformatics, Internet of Things, criminal investigation, media, information technology etc. This huge volume of data in the world has created a new field of data processing which is called Big Data.
Data can be private or public
- Private data includes
- Surveys/Questionnaire
- Clicks
- Messages
- Transaction
- Page views
- Purchases
- Public data includes-
- Tweets
- Blogs
- Reports
- Comments
- Reviews
So, to do something meaningful with the data, we have to convert the unstructured data which is messy and semantically complex into structured data which is clean and easy to consume. This is called Data Processing.
Data Processing Tasks
- Parsing fields from the text.
- Accounting for missing values.
- Identifying and investigating anomalies.
- Summarizing using tables and charts.
The complexity of the data can be measured by messiness and speed of scaling of data as explained in the below points.
- Spreadsheets
- Low data collection frequency.
- 10-100s of rows per day.
- Sometimes involves manual data collection.
- Many files.
- Database
- High frequency of collection .
- 100k rows per day .
- Programmatically corrected.
- ACID
- Distributed Computing
- A very high frequency of data collection.
- Billions/Millions of rows per day.
- Files stored across a cluster of machines.
- Many many files (Web pages, log files).
Tools for Data Processing
Apache Spark
Apache Spark is an open-source, lightning, cluster-computing framework. It is an engine for data processing and analytics.
Features
- Speed
- Support Multiple Languages.
- Advanced Analytics.
Characteristics
- General Purpose
- Exploring
- Cleaning and preparing.
- Applying machine learning.
- Building Data Applications
.
- Interactive Environment
- Called REPL (Read-Evaluate-Print-Loop).
- Interactive environment.
- Fast Feedback.
- Distributed Computing
- Process data across a cluster of machines.
- Integrate with Hadoop.
In order to work with Spark we have to use Spark APIs like,
Almost all the data is processed using specific data structures called RDDs (Resilient Distributed Datasets).
- RDDs are the main programming abstraction in spark.
- With RDDs, you can interact and play with billions of rows of data.
- RDDs are in-memory collections of objects.
- A storage system that stores the data to be processed.
- Cluster manages to help spark run tasks across a cluster of machines.
Components of Spark
- Spark Core- Computing Engine.
- Storage System-Stores the data to be processed. For storage system, you can use a Local file system or HDFS.
- Cluster manager- Help spark run tasks across a cluster of machines. For Cluster manager, you can use built-in cluster manager or YARN (Yet-Another-Resource-Locator).
Storage System and Cluster manager, both are plug and play components.
Apache Spark Ecosystem
In the next article, we will discuss more regarding RDDs and will learn how to load a data set.