Introducing Hadoop /Big Data For Beginners

Distributed System is the core concept of Hadoop.

Why do we need Distributed System?

Companies like Facebook ,NASA, and Google have huge amount of data (can be hundreds of petabytes or exabytes). Each and every like, click, comment, review, update, search, call, visit, piece of financial and health information, transaction, photo upload etc.,  which is a very huge data set,  gets stored somewhere(database).This data is called Big Data.

The amount of data has grown considerably in recent years due to the growth of social networking, education, surveillance cameras, health care, business, satellite images, manufacturing, online purchasing, research analysis, banking, bioinformatics, Internet Of Things, criminal investigation, media , information technology etc. This huge volume of data in the world has created a new field in data processing which is called Big Data.

When the amount of data is smaller, a system with limited resource (hardware, software etc.) is able to handle it, but as the data grows that same system is unable to handle data so an idea is to increase the size of system ie; hardware, software etc. but in this there is high risk of failure like,

Expensive

  • Hardware costs.
  • Software costs
  • High risk of failure

But the data is always increasing so Hadoop distributed computing concepts came to save us.

Instead of a single powerful machine the task is distributed among clusters of machines.

Advantages

  • Commodity hardware.
  • License software is free.
  • Reduce risk of failure ie; Single point of failure (SPOF).
  • 10 times processing power in 1/10th of the cost.

BigData

Big data are the data sets that are high in volume and complex and so it gets difficult to process it using on-hand database management tools or traditional data processing application softwares.

There are some challenges to Big Data which include capturing data, data storage, querying, data analysis, search, sharing, transfer, visualization, updating and information privacy.

Hadoop

There are three dimensions to big data known as Volume, Variety and Velocity.

Big Data System Requirements

Big Data System Requirements

Traditional data technologies are not able to meet these requirements.

Distributed Computing frameworks like hadoop were developed for exactly this.

So a single system cannot meet big data requirements (Resources , memory, storage, speed, time).So we need a distributed system.

There are two ways to build a system

Monolithic                            Distributed

Monolithic      Distributed

Monolithic

  • A powerful player.
  • A single powerful server.
  • 2x Expense
  • <2x performance

Distributed

  • A team of good players who knows how to pass.
  • A cluster of descent machines that know how to parallelize.

Distributed System

Many small and cheap computers come together which acts as a single entity and such a system can scale linearly.

Distributed

 

  • 2 x nodes
  • 2 x storage
  • 2 x speed

 

Server Farms

Companies like Facebook, Google, and Amazon are building vast server farms. These farms have thousands of servers working in tandem to process complex data.

Server Farms

All these servers need to be co-ordinated by a single piece of software.

Single co-ordinating software-

  • Partition data
  • Co-ordinating computing task.
  • Handle fault tolerance and recovery.
  • Allocate capacity to processes.

Introducing Hadoop

  • Google has developed a software to run on these distributed system
  • Stores millions of records on multiple machines
  • Run processes on all these machines to crunch data.

Single co-ordinating software

Single co-ordinating software

Hadoop

There are 2 core components of Hadoop-

HDFS: Hadoop Distributed File System

MR: MapReduce

Hadoop

 

HDFS/MR

A framework to define data processing task.

YARN (Yet Another Resource Negotiator)

A framework to run the data processing task. //More we will discuss about this in later blog

Co-ordination between Hadoop Blocks

Co-ordination between Hadoop Blocks

 

Fig.a User defines map and reduce tasks using mapreduce API.

Fig.b A job is triggered on the cluster.

Fig.c YARN figures out where and how to run the job and stores the result in hdfs.

Other technologies in Hadoop Ecosystem

Hadoop Ecosystem

An ecosystem of tools has sprung up around this core piece of software.

Hadoop Ecosystem
Hadoop

Hive

Provides an Sql interface to hadoop.

The bridge to hadoop for folks who don’t have exposure to OOP in java.

HBase

A database management system on top of hadoop.

Integrate with your application just like a traditional database.

Pig

A data manipulation language.

Transforms unstructured data into a structured format.

Query this structured data using interfaces like hive.

Spark

A distributed computing engine used along with hadoop.

Interactive shell to quickly process datasets.

Has a bunch of built in libraries for machine learning, stream processing, graph processing etc.

Oozie

A tool to schedule workflows on all the Hadoop ecosystem technologies.

Flume/ Sqoop

Tools to transfer data between other systems and Hadoop.

This is just a basic introduction to Hadoop. In the next blog we will discuss about the Hadoop Distributed File System(HDFS) Commands and HDFS Architecture.

Up Next
    Ebook Download
    View all
    Learn
    View all