MongoDB - Day 16 (Sharding)

Article

Before reading this article, I highly recommend reading the following previous parts of the series:

MongoDB has the main advantage that it can automatically spread data across servers without affecting the performance of the application. Any server can be added or removed without application downtime. A well-established and configured database never becomes offline. In other words, it provides 24x365 services.

Introduction

Sharding is the process of storing data records across multiple machines. As the size of the data increases, a single machine can’t store all the data and is unable to provide an acceptable read and write throughput. To overcome this issue, database systems have two basic approaches, vertical scaling and Sharding.

vertical scaling and Sharding

Vertical Scaling

In vertical scaling the database requires some extra storage device to hold the data, like hard disk, memeory and so on. This technique is known as vertical scaling but this approach increases the load on a single machine and increases the chance of system failover. It is also known as a scaling up approach.

Horizontal Scaling

This is also known as scaling out. In this approach we add a new node (server) to the system such that the entire load is distributed over all the servers. MongoDB uses the Scaling Out approach. MongoDB uses a simple approach to do scaling out. It starts with a single or multiple nodes. If 10,000 new users connect with the application it adds another server.

Sharding or horizontal scaling divides the data set and distributes the data over multiple servers. These multiple server are known as shards. Each shard is an independent database and collectively, the shards make up a single logical database. Sharding distributes data over multiple shards so it reduces the number of operations for each shard. Each shard processes fewer operations since the size of the cluster is increased and as a result the cluster can increase capacity and throughput. In the case of selection of a specific record the application doesn’t access the entire database system, it only accesses the desired shard responsible for that record. Sharding reduces the amount of data that each shard must store. For example, if a database contains 1TB of data and we have 4 shards then each shard will contain approximate 250GB of data . If we have 100 shards then each shard contains approximately 10GB data.

database Sharding

Sharding In MongoDB

MongoDB does Sharding using a sharded cluster. The following diagram shows Sharding in MongoDB using sharded clusters.

A sharded cluster contain 3 components. The following describes these components.

Shards: Shards are used to store the data. Each shard contain a replica set. Shards provide high data availability and data consistency.
Config Servers: Config servers are used to store the metadata of clusters. A sharded cluster has exactly 3 config servers. These 3 config servers contain the mapping of the cluster’s data stored in shards. Config servers help the query router to select the desired shards to do the operations.
Query Routers: Query routers are the Mongo's instance that interfaces with the client applications and does the operations of the appropriate shards. When a query comes to the query router then the query router first uses the metadata (config server) and selects the single or multiple shards and performs the desired operations to shards and then returns the results to the client. A sharded cluster may contain one or multiple query routers depending on the query load. A client can send a request to only one query router. Generally a sharded cluster contains multiple query routers.

Data Partitioning

MongoDB distributes data or shards at the collection level. In other words, the data of a collection may be distributed into several shards. Distribution of a collection’s data is done by a shard key.

Shard Key

A shard key shards a collection and determines the distribution of the collection’s documents among the cluster’s shards. A shard key may be an indexed field or an indexed compound field present in every document of the collection. The shard key is also divided into chunks and these chunks are evenly distributed across the shards. MongoDB uses the following two techniques to divide the shard key into chunks.

Range Based Partitions
Hash Based Partitions

Range Based Sharding

MongoDB uses range-based partitions for range-based Sharding of the collection’s data. In range-based sharding MongoDB divides the shard key into a number of non-overlapping ranges called chunks. Each chunk contains a range of values from minimum values to maximum values.

Range Based Sharding

Hash Based Sharding

MongoDB uses hash-based partitions for hash-based sharding. In a hash-based partition MongoDB computes a hash of the field’s value and then uses this hash to create the chunks in the shard key. A hash-based partition provides a more random distribution of collections in the clusters.

Hash Based Sharding

Balanced Data Distribution

In the process of sharding there is a large possibility that the stored data becomes imbalanced in the cluster. Data imbalance may occur due to any of the following reasons.

Addition/Removing of new data.
Addition/Removing of cluster.
A specific shard contains more chunks than other chunks.
The size of a specific chunk is greater than another chunk's size.

MongoDB always tries to balance the data distribution, for this MongoDB uses the following two approaches.

Splitting
Balancing

Splitting

Splitting is a background process that restricts the chunks from growing too large. When the size of a chunk crosses the value of a specified chunk size, MongoDB splits the chunks into half. In the split process MongoDB does not modify the shards or migrate any data. The splits provide sufficient meta-data change.

Splits data

Balancing

Balancing is a background process. Due to an uneven distribution of sharded collections, the query router runs a balancer process. The balancer process migrates chunks from the shard containing a large number of chunks to the shard that contains the least numbers of chunks. The balancer process can be initiated from any query router. After successful chunk migration the metadata regarding the location of the chunks on the config server is updated.

chunks

In the preceding image we can see that the number of chunks in shard A is just double that of shard B. So this is an imbalance condition. Now the migration of chunks from shard A to shard B will occur.

shard

Now we can see that after migration of chunks from shard A to shard B both shards are in a balanced state.

The following are the advantages of sharding:

Sharding distributes the data over multiple shards so it reduces the number of operations for each shard.
Removes the dependency from a single server.
Protects against system failover.
Increases capacity and throughput.

Today we learned the basic concept of sharding. In the next article we learn about backup in MongoDB.

Thanks for reading this article.