Initially Apache Hadoop software library is a framework that allows for the  distributed processing of large amount of data sets across clusters of computers  using simple programming model. It is designed to scale up from single servers  to thousands of machines; that is scale out, each cluster offers local  computation and its storage. Apache Hadoop, rather than rely on hardware to  deliver high-availability of large data processing, the Apache Hadoop library  itself is designed to detect and handle failures at the application layer  itself, so it delivers a highly-available service on top of a cluster of  computers.
 
 The project includes these modules:
  	- Hadoop Common: The common utilities that support the other Hadoop  	modules.
 
 
- Hadoop Distributed File System (HDFS): A distributed file system that  	provides high-throughput access to application data.
 
 
- Hadoop YARN: Is Framework for job scheduling and cluster resource  	management.
 
 
- HadoopMapReduce: A YARN-based system for parallel processing of large  	data sets.
 
 
- HBase: A scalable, distributed database that supports structured  	data storage for large tables.
 
 
- Hive: A data warehouse infrastructure that provides data  	summarization and ad hoc querying.
 
 
- Mahout: A Scalable machine learning and data mining library.
 
 
- Pig: A high-level data-flow language and execution framework for  	parallel computation.
 
 
- Spark: A fast and general compute engine for Hadoop data. Spark  	provides a simple and expressive programming model that supports a wide  	range of applications, including ETL, machine learning, stream processing,  	and graph computation.
In Microsoft Azure portal HDInsight is the name of the service available for cloud  based hadoop service and it is Microsoft's managed Big Data stack in the cloud.  With Azure you can provision clusters running Storm, HBase, and Hive which can  process thousands of events per second, store petabytes of data, and give you a  SQL-like interface to query.
 
 Data having three 3 V’s -Volume – Variety – Velocity
 
The IT industry is full of data;  currently the world's population is 7.2 billion and  devices are 15 billion. According to estimates in 2020 devices will be double that at 30  billion and each device will produce data. So assume we are having a large  unstructured data set and you want to run a Hive query on it to extract some  meaningful information. Data should be converted in to meaningful information.  Azure HDInsight uses a azure storage to store data, when we create HDInsight  cluster we need to specify storage account so a specific blob container is used  in that and file system HDFS. Hadoop offers a distributed platform to store and  manage big data. You can run query on the unstructured data using hive which enables  querying and managing large amount of unstructured data using SQL like query  language. You can output the data and import in MS excel or any other BI tool.
 
 Apache Spark cluster on HDInsight
 
 You should create a storage account in azure and then you should create Spark  cluster on azure and you can run Spark SQL statements using notebooks. Jupyter  notebook is also popular to write spark SQL queries, by default jupyter  notebooks comes with a python 2 kernal, HDInsight Spark clusters provide two  additional kernels that you can use with the Jupyter notebook. These are:
  	- PySpark (for applications written in Python)
- Spark (for applications written in Scala)
A couple of key benefits of using the PySpark kernel are: You do not need to  set the contexts for Spark, SQL, and Hive. These are automatically set for you.  You can use different cell magics (such as %%sql or %%hive) to directly run your  SQL or Hive queries, without any preceding code snippets. The output for SQL or  Hive queries is automatically visualized.
 
 HBase is a data model that is similar to Google’s big table designed to provide  quick random access to huge amounts of structured data. 
 
 Apache Storm is a scalable, fault-tolerant, distributed, real-time computation  system for processing streams of data. With Storm on Azure HDInsight, you can  create a cloud-based Storm cluster that performs big data analytics in real  time.
 
 To create HD Insight cluster you should create storage account first in azure  portal.
 
 ![create HD Insight cluster]()
 
 After successful creation of storage account, click on new button data service,  click on HD insight and you will find list of hadoop services like hadoop, Hbase,  storm, sparx, linux and custome create options available. You can click on  hadoop than enter unique custom name (*.azurehdinsight.net) than select cluster  size 1 node, 2 node or 4 node clusters and HTTP username is fixed admin you need  to enter password and then select storage account name.
 
 ![storage account]()