NoSQL
For many decades we have been using traditional database systems, Relational Database Management Systems (RDBMSs). Dr. E.F. Codd is known as the father of RDBMS. The idea of the relational model came with E. F. Codd's 1970 paper "A relational model of data for large shared data banks" that made data modeling and application programming much easier. We use the Structured Query Language (SQL) for RDBMSs. RDBMSs are a schema-based database. That means there is a schema enforced by the database themselves.
Each relational database follows the ACID rules. These ACID rules ensure that each transaction must be atomic, consistent, isolated and durable. Here we have discussed the following four points:
- Atomicity: Atomic means that all the work in the transaction is treated as a single unit. Either it is all performed or none of it is and at the point of failure, previous operations are rolled back to their former state.
- Consistency: Transactions ensure that the database properly changes states upon a successfully committed transaction. In other words, if the transaction has completed successfully then the database should be in a new state that will reflect the changes else the transaction remains in the same state as at the start of the transaction.
- Isolation: It ensures that transactions operate independently and transparent to each other. In other words, if more than one transaction is running then thy will not have an effect on each other.
- Durability: It ensures that the effect of a committed transaction will be saved in the database permanently and should persist no matter what (like power failure or something).
RDBMSs satisfied all the needs of databases of the time it was introduced. But in the present time the structure and requirements of databases is changing frequently. In today's time data is becoming easier to access. Starting in the late 90s as the internet became widely used, companies like Google , Amazon and Yahoo found that the structured/relational database methods weren't working well for several reasons.
- Amount of Data: At the present time each company or organization has a large amount of data. So the databases have become too large to fit into a single database table on a single machine. These data are distributed on different machines. So RDBMSs are inefficient to handle such amounts of data.
- Numbers of User: Some companies like Google, Facebook and Yahoo have many users. Sometimes millions of users become active and the next time some only thousands of users remain active. It is difficult to predict the numbers of active users at a specific time. So with relational technologies it become difficult or even impossible to get the dynamic scalability and level of scale they need while also maintaining the performance users demand.
- Unstructured Data: RDBMS was mainly designed for structured data. But at the present time unstructured or semi-structured data is generated by companies and users. Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. RDBMSs are inefficient to handle unstructured or semi-structured data.
- Scaling Technique: Scaling means upgrading the existing hardware without changing much of the application or by adding extra hardware. RDBMSs work on the scaling up (vertical) technique. Scaling up means add resources within the same logical unit to such as adding a new computer to a distributed software application. In the scaling up approach we mainly use a single server . If this server fails then the entire system will block.
To overcome all these problems we require a new type of database. This requirement is satisfied when NoSQL comes into the market.
NoSQL
A NoSQL database is called a Not Only SQL. NoSQL is an approach to data management and database design that's useful for very large sets of distributed data. NoSQL is different from traditional relational database management systems in some significant ways. It is designed for distributed data stores where very large scale of data storing needs. For example Google uses BigTable, Twitter uses FlockDB and Facebook uses Cassandra. BigTable, Cassandra and FlockDB are a type of NoSQL database.
NoSQL is also called a Cloud database. It is designed for 21th century web estates. In NoSQL databases we don't require any fixed schema. In other words, it is a schema-less database technique. It works on a scaling out approach instead of scaling up approach. NoSQL databases are designed to solve the scalability, big user and big data performance issues that we encounter in relational databases. NoSQL databases are useful when an organization or enterprise needs a massive amount of unstructured data that is stored on several remote virtual servers in the cloud. In general, NoSQL databases have become the first alternative to relational databases, with scalability, availability and fault tolerance being key deciding factors.
The following are some Important points about NoSQL databases:
- Currently open-source
- Does not use the Relational Model
- Schema-less database
- Running well on clusters
- Supports un-structured or semi-structured data
- Designed for a cloud
Why NoSQL Database
Now we understand the 10 points of why NoSQL is better than SQL.
Big User
Many organizations like Facebook, Google, Yahoo, Twitter have millions of users. But the amount of userd is not constant. It other words, sometimes millions of users become active and sometimes only a thousand users are active. So the numbers of users are consantly changing. Supporting large numbers of concurrent users is important, but because app usage requirements are hard to predict, it's just as important to dynamically support rapidly growing (or shrinking) numbers of concurrent users.
So due to the inconsistent numbers of active users we should have a more easily scalable database technology. Using the relational database technique, we can't achieve dynamic scalability. It also important that during achieving this approach the performance of an application must be maintained. So we can use NoSQL for this purpose.
Big Data
Big Data is one of the key forces driving the growth and popularity of NoSQL for businesses. Due to the explosive growth in internet usage, each time a bulk of data is generated. This data is generated by computers, mobiles, social apps and machine-to-machine communication. Let us see a simple example. A commercial flight generates approximate 10 GB of data per hour during its travel. According to the IDC estimate until 2013 the size of the world's digital data is 4.4 zettabytes (4.4 trillion gigabytes) and it will become 44 zettabytes in 2020.
We can define big data in the following terms.
- High data Velocity: Data comes into high velocity from multiple locations.
- High Data Volume: Data come into gigabytes, terabytes or petabytes size.
- High Complexity: Data complexity is very high because data is stored at different locations.
- Data Variety: Data comes in structured, semi-structured and unstructured forms.
So developers want a highly flexible solution that can handle big data. We can't do this with schema-based relation databases. So NoSQL is a perfect solution for handling big and schema-less data.
Continuous Data Availability
At the present time the continuous availability of data is very important. The downtime of a few seconds can generate a huge loss in business and a company's reputation. The best solution to avoid this is to use a distributed approach. NoSQL also works on a distributed approach. In a distributed approach we remove dependency from a single machine and spread it out on several machines. If one or more database servers or "nodes" go down then the other nodes in the system are able to continue with operations without data loss. NoSQL databases work on a distributed approach so a NoSQL database is able to provide continuous availability whether in single locations, across data centers and in the cloud.
Dynamic Schema
Relational database systems require a schema to be defined before inserting any data. For example, if we want to insert information about an employee like his name, salary and age then we first must define three columns in the table and then their data types. But this approach is not suitable in a present-time application. Because we don't know what type of data will come from the user end and how much In the future, if we are required to change the schema of the database then it will become very difficult and make work for the developer . If database is very large then it will generate downtime of system. There's also no way, using a relational database, to effectively address data that's completely unstructured or unknown in advance.
NoSQL databases are different from relational databases. NoSQL are schema-less databases so we are not required to define a schema. We can make changes in the database without worrying about service interruptions. In other words NoSQL makes development faster.
Integrated Caching
Some products provide a caching tier for relational database systems for reading the data. So it increases only the read performance. But these products don't provide any caching for writes. So if our application is predominately read-only then we can use a distributed cache but if our application is either predominately write or read-write then we cannot use a distributed cache.
NoSQL has an integrated cache capability for both read and write. We can keep frequently-used data in system memory as much as possible and eliminates the need for a separate caching layer that must be maintained.
Cloud Computing
Toady each new application uses cloud storage, either directly or indirectly. This cloud may be public, private or hybrid. All cloud applications use a three-tier internet architecture. In this architecture the application is accessed through a web browser or a mobile application. A load balancer is responsible for incoming traffic. Load balancing uses a scaling-out approach to handle the incoming traffic. In a scaling-out approach we add a new commodity server when traffic increases. But in a relational database we use a scaling-up approach instead of scaling-out. This makes them a poor fit for applications that require easy and dynamic scalability.
Because NoSQL uses a scaling-out approach and relational databases use a scaling-up approach, NoSQL is a better fit with the highly distributed nature of the three-tier internet architecture.
Scaling Out Approach
Scaling means upgrading the existing hardware without changing much of the application or by adding extra hardware. Due to an increase in concurrent users or volume of data, databases need to scale. There are two ways to achieve scale mechanism 1. Scale Up 2. Scale Out:
- Scale Up
This is also known as Vertical scaling. In a scaling up approach we add resources within the same logical unit to increase capacity. Relational databases mainly use a scale-up approach. For example, add a CPU to a single server or add (increase) memory or add some external storage device to increase the storage capacity. This approach increases the size of the server. Such types of big server becomes highly complex and expensive. This approach has the big disadvantage that If the server fails then the entire system blocks.
- Scale Out
This is also known as horizontal scaling. In this approach we add a new node (server) to the system such that the entire load becomes distributed over all servers. A NoSQL database uses a Scaling Out approach. NoSQL database uses a simple approach to achieve a scaling out mechanism. It starts with a single or multiple nodes. If 10,000 new users connect with an application then it adds another server. NoSQL uses a cluster of standard, physical or virtual servers to store data and support database operations. When a new server (node) is connected to a cluster then data and database operations are spread across the entire cluster system.
Replication
Data replication is the concept of having data, within a system, be geo-distributed, preferably through a non-interactive, reliable process. In traditional RDBMS databases, implementing any sort of replication is a struggle because these systems were not developed with horizontal scaling in mind. Most of NoSQL database support automatic replication. In other words we get high availability of data and disaster recovery without adding any external applications.
Auto Sharding
NoSQL has a main advantage that the data is spread across servers without effecting the performance of the application. Any server can be added or removed without application downtime. A well-established and configured NoSQL database never becomes offline. In other words, it provides 24x365 services.
Sharding in relational databases can reduce the capacity to perform complex queries. But NoSQL always retains its query expressive power even though system contains hundreds of servers.
The Internet of Things
Machine-generated data has a big presence in big data. The Internet of Things extends internet connectivity beyond traditional devices like desktop and laptop, smartphones and tablets to a diverse range of devices and everyday things that utilize embedded technology to communicate and interact with the external environment, all via the Internet. Some examples of IOT are fire extinguishers in the building, connected security systems, thermostats, cars, electronic appliances and lights in household and commercial environments. All these devices are connected to the internet and generate data each time. Approximately 20 billion devices are connected to the internet and these devices generate data from their 50 billion sensors.
Data generated by IOT are mainly semi-structured or unstructured and that poses a challenge for relational databases because relational databases work on a fixed schema and structured data.
To overcome all these problems an inventor uses a NoSQL database to store the data to improve performance.
Types of NoSQL Database
NoSQL databases can be categorized into four types, each has their own specific attributes.
Key-Value Database
This is the simplest NoSQL database. In this database each item is stored with a key. The user can search or delete data using this key value. This key is like a primary key. It can't be a duplicate. Examples of this type database are Azure Table Storage (ATS), DyanmoDB, Riak and Redis.
Let us see an example.
Column Type Database
In column type databases we store data as sections of columns of data, rather than as rows of data. It is also known as wide-column stores. Column type databases offer very high performance and a highly scalable architecture. For example Cassandra and HBase.
The preceding data will be stored in the following form.
Graph Database
These databases are designed for data whose relations can be represented as a graph and has elements that are interconnected, with an undetermined number of relations among them. Graphs store databases used to store information about networks, such as social connections. For example Neo4J and HyperGraphDB.
Document Database
In a document database the data is stored as a document. Each document contains a unique key. The document can contain many key-value pairs, key-array pair or even a nested form. These are designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. For example MongoDB and CouchDB.
The following table provides basic information about the properties of each database type.
The following table provides some names of the various types of databases.
Disadvantages of NoSQL
NoSQL has the following disadvantages.
- Narrow Focus: NoSQL databases have very narrow focus. NoSQL is mainly designed for storage but it provides very little functionality. In the field of Transaction Management relational databases are a better choice than NoSQL.
- Open-source: NoSQL is open-source database. Open-source is it's greatest strength. But that also has some cons, like there is no reliable standard for NoSQL yet. In other words two database systems are likely to be unequal.
- Management Challenge: The purpose of big data tools is to make management of a large amount of data as simple as possible. But it is not so easy. Data management in NoSQL is much more complex than a relational database. NoSQL, in particular, has a reputation for being challenging to install and even more hectic to manage on a daily basis.
- GUI is not Available: GUI mode tools to access the database is not flexibly available in the market.
- Backup Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has no approach for the backup of data in a consistent manner.
- Large Document Size: Some database systems like MongoDB and CouchDB store data in JSON format. Which means that documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they increase the document size.
Difference between SQL and NoSQL
Now we consider some major differences between SQL and NoSQL databases.
Database Types
- SQL: SQL database is one type (SQL). Each RDBMS system has minor differences.
- NoSQL: It has four types, Key-Value pair, Column Based, Document Type and Graph database.
Schemas
- SQL: SQL is a schema-based database. The structure and data types must be predefined before inserting any data. If we want to change the data type of any field then the database must be altered.
- NoSQL: NoSQL is a schema-less database. In other words we are not required to define the structure. Dissimilar data can be stored together.
Examples
- SQL: SQL Server, My SQL, Oracle, Postgres and SQL Compact.
- NoSQL: MonhoDB, CouchDB, Cassandra , Neo4j and HBase.
Scaling
- SQL: SQL works on a scaling up (vertical) technique. In a scaling up approach we add resources within the same logical unit to increase the capacity. For example add a CPU to a single server or add (increase) memory or add some external storage devices to increase storage capacity.
- NoSQL: NoSQL works on a scaling out approach. In this approach we add a new node (server) to the system such that an entire load is distributed over all the servers.
Data Manipulation
- SQL: Data manipulation is done using SQL commands like Select, Update and Delete, for example Delete From , Select *.
- NoSQL: Through Object Oriented APIs, for example MongoDB db.collection_name.find(), db.collection_name.remove().
Transaction Management
- SQL: SQL supports transaction management. Either the entire query set is submitted successfully or not at all.
- NoSQL: NoSQL supports transaction management in certain circumstances and at certain levels (for example, document level vs. database level).
Development Model
- SQL: Some RDBMSs are open-source, like Postgres, MySQL and some RDBMSs are closed-source like SQL Server and Oracle.
- NoSQL: NoSQL is completely open-source.
Consistency
- SQL: SQL support has a high level of consistency.
- NoSQL: Some database systems are highly inconsistent, for example MongoDB and some database system are eventual consistence, for example Cassandra.
Replication
- SQL: Replication is a difficult task for SQL databases because these systems were not developed with horizontal scaling.
- NoSQL: NoSQL supports automatic replication.
Conclusion
Its never means that NoSQL is replacing relational databases. NoSQL is developed to handle the huge amount of data that cannot be handled by relational types of database in a proper way. We are entering an era of polyglot persistence, a technique that uses various data storage technologies to handle varying data storage needs. NoSQL is a rising technology in the field of databases. It requires large improvements. But it is making it easy to handle the huge amount of data that is be generated each second.