Introduction
This blog is to teach you about Hadoop Framework.
Hadoop
- Hadoop is an open-source framework.
- Hadoop Framework is used to sort our application data (large amount of data).
- Hadoop Framework is used to store any kind of data, as it has a large amount of storage space.
- Hadoop Framework is used to handle data virtually.
Hadoop History
- 1990 Apache Software foundation introduced Hadoop Framework technology.
- 2004 Google Publishes GFS paper.
- 2005 Nutch uses MapReduce.
- 2008 Becomes Apache top-level project.
- 2009 Yahoo uses hadoop.
- 2013 Hadoop is used in many companies.
We need to understand Big Data before learning about the Hadoop Framework.
Big Data
- Big Data is a collection of large amounts of data that cannot be processed by our computer technologies. Big Data technologies provide a very accurate analysis, so it's useful to know the results.
- Big data has some challenges. It helps in capturing the data, curation, storage, searching, sharing, transfer, analysis and presentation.
Hadoop is important
- Quickly store & process large amount of data.
- Computing power.
- Fault tolerance.
- Flexibility & low cost.
- Scalability.
Challenges in Hadoop
- MapReduce programming is not matched for all problems.
- Data security issues.
- Hadoop is not easy to use.
Hadoop Data Gathering
Here, we will learn how to add our data to Hadoop.
- Third-party vendor connectors (SAS/ACCESS or SAS Data Loader for Hadoop) are used to update our data to Hadoop
- Apache Flume is used to load the data to Hadoop.
- Some simple Java commands are used to transfer the data form the files into Hadoop.
Hadoop Components
- HDFS-Distributed System.
- MapReduce-Distributed Data Processing Model.
Hadoop Architecture
- It operates on top of an existing file system.
- Its files are stored as the blocks.
- It provides reliability through replication.
- NameNode stores the metadata and manages access.
- No data caching due to large datasets.