As we saw in the prior article, every machine works on its own portion of data. Let's talk about data storage strategies and key design goals/assumptions.
HDFS Design Goal
- Reliability
- We should not loose data in any scenario.
- We use many hardware devices and inevitably something will fail (Hard Disk, Network Cards, Server Rack and so on) at some point or another.
- Scalability
- We should be able to add more hardware (workers) to get the job done.
- Cost Effective
- The system needs to be cheap, after all we are building a poor man's super computer and we do not have a budget for fancy hardware.
Data Assumptions
- Process large files both horizontally and Vertically (Giga Byte +).
- Data is append only.
- Access to data is large and sequential.
- Write Once and Read many times.
Throughput vs. Random Seek
Since we are working with large datasets with sequential read, Hadoop and Map Reduce is optimized for throughput and not random seek.
In other words, if one must move a big house from one city to another than a slow big truck will do a better job, than a fancy small fast car.
Data Locality
It is cheaper to move the compute logic than data. In Hadoop, we move the computation code around where the data is present, instead of moving the data back and forth to the compute server; that typically happens on a client/server model.
Data Storage in HDFS
Let's say we need to move a 1 Gig text file to HDFS.
- HDFS will split the file into 64 MB blocks.
- The size of the blocks can be configured.
- An entire block of data will be used in the computation.
- Think of it as a sector on a hard disk.
- Each block will be sent to 3 machines (data nodes) for storage.
- This provides for reliability and efficient data processing.
- Replication factor of 3 is configurable.
- RAID configuration to store the data is not required.
- Since data is replicated 3 times the overall storage space is reduced a third.
- The accounting of each block is stored in a central server, called a Name Node.
- A Name Node is a master node that keeps track of each file and its corresponding blocks and the data node locations.
- Map Reduce will talk with the Name Node and send the computation to the corresponding data nodes.
- The Name Node is the key to all the data and hence the Secondary Name node is used to improve the reliability of the cluster.
HDFS in Picture
Examples of HDFS
HDFS List file
$ hdfs dfs –ls /
HDFS move file to cluster
$ hdfs dfs -put googlebooks-eng-all-1gram-20120701-b /apps/hive/warehouse/menish.db/ngram/.
HDFS view file in the cluster
$ hdfs dfs -cat /apps/hive/warehouse/menish.db/ngram/googlebooks-eng-all-1gram-20120701-a
HDFS is a virtual distributed file system. It exposes file system access similarly to a traditional file systems. However the file is split into many parts in the background and distributed on the cluster for reliability and scalability.
In the next article, we will discuss the map reduce program and see how to leverage this data structure and storage paradigm.
Resources: