Before reading this article, I highly recommend reading the following previous parts of the series:
Introduction
Aggregation functions performs operations on groups of documents and returns the computed result. Aggregation functions mainly group the data of multiple documents and performs various operation on the grouped data and returns a single or multiple results.
Types of aggregate functions
MongoDB performs aggregate operations in one of the following three ways.
- Single-purpose aggregate methods and commands.
- Pipeline.
- Map Reduce
First, we will create a “Demo” collection and insert the following data into that collection.
- {
- "_id": ObjectId("55dc6c94b32228b6ef8753c0"),
- "Name": "Pankaj Choudhary",
- "Age": 21,
- "Salary": 25000
- }
- {
- "_id": ObjectId("55dc6ca8b32228b6ef8753c1"),
- "Name": "Sandeep Jangid",
- "Age": 22,
- "Salary": 27000
- }
- {
- "_id": ObjectId("55dc6cb9b32228b6ef8753c2"),
- "Name": "Rahul Prajapat",
- "Age": 23,
- "Salary": 37000
- }
- {
- "_id": ObjectId("55dc6ccab32228b6ef8753c3"),
- "Name": "Sanjeev Baldia",
- "Age": 22,
- "Salary": 28000
- }
- {
- "_id": ObjectId("55dc6cdcb32228b6ef8753c4"),
- "Name": "Narendra Sharma",
- "Age": 25,
- "Salary": 25000
- }
- {
- "_id": ObjectId("55dc6cf5b32228b6ef8753c5"),
- "Name": "Nitin Yadav",
- "Age": 28,
- "Salary": 35000
- }
- {
- "_id": ObjectId("55dc6d09b32228b6ef8753c6"),
- "Name": "Omveer Choudhary",
- "Age": 32,
- "Salary": 37000
- }
Now first we read about single-purpose aggregates.
Single-Purpose aggregate methods and commandsAccording to its name single-purpose aggregate methods are used to do the specific aggregation operations on set of data. Single-purpose aggregate methods are less complex but with limited scope compared to pipeline and map reduce operations. Single-purpose aggregate methods and commands provide straightforward semantics for common data processing options.
The following are some important single-purpose operations.
Count
The Count operation takes a number of documents and depending on the match query returns the count of the documents. In MongoDB the count command and cursor.count() method are used to do the count operations.
Example 1
Output
7In this example Demo.count() is a cursor method that returns the count of all the documents present in the Demo collection.
Example 2
- db.runCommand({count:'Demo',query:{Salary:{$gt:30000}}})
Output{ "n" : 3, "ok" : 1 }In this example the count command takes all the documents in which the value of the Salary field is greater than 30000 and returns the count.
Distinct
The Distinct operation takes a document and depending on the match query returns the unique values for a field. In MongoDB the cursor.distinct() method and the distinct command performs the distinct operations.
Example 1
- db.Demo.distinct("Salary")
Output[ 25000, 27000, 37000, 28000, 35000 ]In this example we use the distinct cursor method that return the distinct value for the “Salary” field.
Example 2
- db.runCommand({distinct:'Demo',key:"Salary",query:{Salary:{$gt:30000}}})
Output
- {
- "values":
- [
- 37000,
- 35000
- ],
- "stats":
- {
- "n": 3,
- "nscanned": 0,
- "nscannedObjects": 7,
- "timems": 598,
- "planSummary": "COLLSCAN"
- },
- "ok": 1
- }
In this example the distinct command returned the distinct values for the “Salary” field where the value of the “Salary” field is greater than 30000.
Group
The Group operation takes a number of documents and depending on the match query creates a group of fields grouped by their value and finally returns an array of documents with the computed result for each group. In MongoDB the group command and the cursor.group() methods do the group operations.
Example 1
- db.Demo.group({key:{Age:1},reduce:function(cur,result){result.Salary+=cur.Salary},initial:{Salary:0}})
Output
- [
- {
- "Age": 21,
- "Salary": 25000
- },
- {
- "Age": 22,
- "Salary": 55000
- },
- {
- "Age": 23,
- "Salary": 37000
- },
- {
- "Age": 25,
- "Salary": 25000
- },
- {
- "Age": 28,
- "Salary": 35000
- },
- {
- "Age": 32,
- "Salary": 37000
- }]
In this example we used the group cursor method. This method groups the documents on the basis of the value of the “Age” field and computes the sum of Salary for each group.
Example 2
- db.runCommand({group:{ns:'Demo',key:{Salary:1},cond:{Salary:{$lt:30000}},$reduce:function(cur,resu
- lt){result.Age+=cur.Age},initial:{Age:10}}})
Output
- {
- "retval":
- [
- {
- "Salary": 25000,
- "Age": 56
- }, {
- "Salary": 27000,
- "Age": 32
- }, {
- "Salary": 28000,
- "Age": 32
- }
- ],
- "count": NumberLong(4),
- "keys": NumberLong(3),
- "ok": 1
- }
In this example we use the group command. This command groups the documents on the basis of the value of the Salary field and computes the sum of the Age field for each group.
Aggregate Pipeline
Pipeline means the possibility to execute an operation on some input and use the output as the input for the next command and so on. An aggregate pipeline is a framework modeled on the concept of data processing pipelines. In aggregate pipelines documents enter a multi-stage pipeline that transforms the documents into an aggregated result. The pipeline operation use the match query to fetch the exact documents and grouping to generate the group of documents. Pipeline operations provide tools for sorting documents by specified fields. A pipeline operation contains many stages, like filters that operate like queries, transformation of documents that modify the form of output. Mainly aggregate commands operate on a single collection and l pass the entire collection into the aggregation pipeline.
Aggregate pipelines are an alternative to map-reduce. It may be an appropriate choice for aggregate operations because map-reduce are unnecessarily complex. But pipelines have some limited behavior on values types and result size.
Syntaxdb.Collection_Name.aggregate({Pipeline expression})ParameterPipeline expression: Pipeline expression specifies the transformation to apply on the input documents. Pipeline expression only operates on the current documents of the pipeline stage.
In an aggregate pipeline, documents pass through many stages and each stage transforms the documents into another form. It is not recommended that each pipeline stage produce a document with respect to each documents. Some stages might reduce some documents and might generate new documents.
A pipeline aggregate contains a number of stages, each stage takes some documents as input and does operations on these documents and generates output. The following are the possible stages in an aggregate pipeline:
Stage |
Description |
$project |
Select specific fields from a collection |
$match |
Specifies the selection criteria, to reduce the amount of documents |
$group |
Used to divide the documents into various groups |
$sort |
Sort the documents |
$skip |
Used to skip a number of documents |
$limit |
Defines the number of documents in an output result |
$unwind |
Unwinds the documents using arrays |
$redact |
Reshapes each document by restricting the content for each document |
$out |
It is the last stage of a pipeline that writes the resulting documents of the aggregate pipeline to a collection. |
Let us see an example to understand the concept of aggregate pipelines.
Example 1
- db.Demo.aggregate([{$group:{_id:"$Age",Salary_Is:{$sum:"$Salary"}}}])
Output
- {
- "_id": 32,
- "Salary_Is": 37000
- }
- {
- "_id": 25,
- "Salary_Is": 25000
- }
- {
- "_id": 23,
- "Salary_Is": 37000
- }
- {
- "_id": 22,
- "Salary_Is": 55000
- }
- {
- "_id": 28,
- "Salary_Is": 35000
- }
- {
- "_id": 21,
- "Salary_Is": 25000
- }
In this example we group the documents by the field “Age” and for each group we calculate the sum of Salary. The following is the equivalent SQL query for this example.
Select Age,sum(Salary) from Demo group by Age.
Example 2
- db.Demo.aggregate
- ([{
- $group: {
- _id: "$Age",
- Sum_Salary: {
- $sum: "$Salary"
- },
- Avg_Salary: {
- $avg: "$Salary"
- },
- Min
- _Salary: {
- $min: "$Salary"
- },
- Max_Salary: {
- $max: "$Salary"
- }
- }
- }])
Output
- {
- "_id": 32,
- "Sum_Salary": 37000,
- "Avg_Salary": 37000,
- "Min_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "_id": 25,
- "Sum_Salary": 25000,
- "Avg_Salary": 25000,
- "Min_Salary": 25000,
- "Max_Salary": 25000
- }
- {
- "_id": 23,
- "Sum_Salary": 37000,
- "Avg_Salary": 37000,
- "Min_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "_id": 22,
- "Sum_Salary": 55000,
- "Avg_Salary": 27500,
- "Min_Salary": 27000,
- "Max_Salary": 28000
- }
- {
- "_id": 28,
- "Sum_Salary": 35000,
- "Avg_Salary": 35000,
- "Min_Salary": 35000,
- "Max_Salary": 35000
- }
- {
- "_id": 21,
- "Sum_Salary": 25000,
- "Avg_Salary": 25000,
- "Min_Salary": 25000,
- "Max_Salary": 25000
- }
In this example we group documents by “Age” field and calculate the sum, average, minimum and maximum salary. The following is the equivalent query in SQL.
Select Age,sum(Salary), avg(Salary), max(Salary) , min(Salary) from Demo group by Age.Example 3
- db.Demo.aggregate
- ([{
- $group:
- {
- _id: "$Age",
- Sum_Salary:
- {
- $sum: "$Salary"
- },
- Avg_Salary:
- {
- $avg: "$Salary"
- },
- Min
- _Salary:
- {
- $min: "$Salary"
- },
- Max_Salary:
- {
- $max: "$Salary"
- }
- }
- },
- {
- $match:
- {
- _id:
- {
- $gt: 21,
- $lt: 27
- }
- }
- }])
Output
- {
- "_id": 25,
- "Sum_Salary": 25000,
- "Avg_Salary": 25000,
- "Min_Salary": 25000,
- "Max_Salary": 25000
- }
- {
- "_id": 23,
- "Sum_Salary": 37000,
- "Avg_Salary": 37000,
- "Min_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "_id": 22,
- "Sum_Salary": 55000,
- "Avg_Salary": 27500,
- "Min_Salary": 27000,
- "Max_Salary": 28000
- }
In this example, documents are passed through two stages. The first stage is “group”. In that stage we group documents by the “Age” field and calculate the sum, average, minimum and maximum salary. The second stage is “match”, that stage filters all the documents depending on the value of the Age field.
The following query also generate the same output
- db.Demo.aggregate
- ([{
- $match:
- {
- Age:
- {
- $gt: 21,
- $lt: 27
- }
- }
- }, {
- $group:
- {
- _id: "$Age",
- Sum_Salary:
- {
- $sum: "$Salary"
- },
- Avg_Salary:
- {
- $avg: "$Salary"
- },
- Min_Salary:
- {
- $min: "$Salary"
- },
- Max_Salary:
- {
- $max: "$Salary"
- }
- }
- }])
The following is the equivalent query in SQL.
Select Age,sum(Salary), avg(Salary), max(Salary) , min(Salary) from Demo Where Age>21 And Age<27 group by Age Example 4
- db.Demo.aggregate
- ([{
- $limit: 3
- },
- {
- $group:
- {
- _id: "$Age",
- Sum_Salary:
- {
- $sum: "$Salary"
- },
- Avg_Salary:
- {
- $avg: "
- $Salary "
- },
- Min_Salary:{$min:"
- $Salary "},Max_Salary:{$max:"
- $Salary "}
- }
- }])
Output
- {
- "_id": 23,
- "Sum_Salary": 37000,
- "Avg_Salary": 37000,
- "Min_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "_id": 22,
- "Sum_Salary": 27000,
- "Avg_Salary": 27000,
- "Min_Salary": 27000,
- "Max_Salary": 27000
- }
- {
- "_id": 21,
- "Sum_Salary": 25000,
- "Avg_Salary": 25000,
- "Min_Salary": 25000,
- "Max_Salary": 25000
- }
In this example documents are passed through two stages. The first stage is “limit” that selects the top 3 documents from the collection and pass these documents to the second stage. The second stage groups the documents by the “Age” field and calculates the sum, average, minimum and maximum salary.
Important Point
The order of the stages in the aggregate pipeline is very important. Like db.Demo.aggregate($limit,$group) and db.Demo.aggregate($group,$limit) the methods don’t provide the same result.
Now we will execute the previous method but interchange the order of the “limit” and “group” stages.
- db.Demo.aggregate
- ([{
- $group:
- {
- _id: "$Age",
- Sum_Salary:
- {
- $sum: "$Salary"
- },
- Avg_Salary:
- {
- $avg: "$Salary"
- },
- M
- in_Salary:
- {
- $min: "$Salary"
- },
- Max_Salary:
- {
- $max: "$Salary"
- }
- }
- }, {
- $limit: 3
- }])
Output
- {
- "_id": 32,
- "Sum_Salary": 37000,
- "Avg_Salary": 37000,
- "Min_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "_id": 25,
- "Sum_Salary": 25000,
- "Avg_Salary": 25000,
- "Min_Salary": 25000,
- "Max_Salary": 25000
- }
- {
- "_id": 23,
- "Sum_Salary": 37000,
- "Avg_Salary": 37000,
- "Min_Salary": 37000,
- "Max_Salary": 37000
- }
We can see that the result of this method and previous method is not the same, so the selection of the order of the stages is very important. This selection may change the expected result.
Example 5
- db.Demo.aggregate
- ([{
- $group:
- {
- _id: "$Age",
- Sum_Salary: {
- $sum: "$Salary"
- },
- Avg_Salary: {
- $avg: "$Salary"
- },
- M
- in_Salary: {
- $min: "$Salary"
- },
- Max_Salary: {
- $max: "$Salary"
- }
- }
- }, {
- $project: {
- Sum_Salary: 1,
- Max_Salary: 1,
- _id: 0
- }
- }, {
- $limit: 4
- }])
Output
- {
- "Sum_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "Sum_Salary": 25000,
- "Max_Salary": 25000
- }
- {
- "Sum_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "Sum_Salary": 55000,
- "Max_Salary": 28000
- }
In this example documents passed through the following three stages. The first stage is “group” as described in the previous examples. The second stage is “project”. In that stage we only select the Sum_Salary and Max_Salary fields to display. The third stage is “limit”, in that stage we select the top 4 documents from the result of the previous stage (project).
Example 6
- db.Demo.aggregate([{
- $group: {
- _id: "$Age",
- Sum_Salary: {
- $sum: "$Salary"
- },
- Avg_Salary: {
- $avg: "$Salary"
- },
- M
- in_Salary: {
- $min: "$Salary"
- },
- Max_Salary: {
- $max: "$Salary"
- }
- }
- }, {
- $project: {
- Sum_Salary: 1,
- Max_Salary: 1,
- _id: 0
- }
- }, {
- $limit: 4
- }, {
- $skip: 2
- }])
Output
- {
- "Sum_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "Sum_Salary": 55000,
- "Max_Salary": 28000
- }
This example is the same as the previous example but we add an extra stage, “skip”. So the result of the “limit” stage is passed to the “skip” stage and this stage removes the top 2 documents from the collection and prints the remaining documents.
Example 7
- db.Demo.aggregate([{
- $group: {
- _id: "$Age",
- Sum_Salary: {
- $sum: "$Salary"
- },
- Avg_Salary: {
- $avg: "$Salary"
- },
- M
- in_Salary: {
- $min: "$Salary"
- },
- Max_Salary: {
- $max: "$Salary"
- }
- }
- }, {
- $project: {
- Sum_Salary: 1,
- Max_Salary: 1,
- _id: 0
- }
- }, {
- $limit: 4
- }, {
- $sort: {
- Sum_Salary: 1
- }
- }])
Output
- {
- "Sum_Salary": 25000,
- "Max_Salary": 25000
- }
- {
- "Sum_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "Sum_Salary": 37000,
- "Max_Salary": 37000
- }
- {
- "Sum_Salary": 55000,
- "Max_Salary": 28000
- }
This example is similar to Example 6 but in this example we use an extra stage, “sort". That stage retrieves the documents from the output of the “limit” stage and sorts the documents depending on the value of the Sum_Salary field in ascending order.
Example 8
- db.Demo.aggregate([{
- $group: {
- _id: "$Age",
- Sum_Salary: {
- $sum: "$Salary"
- },
- Avg_Salary: {
- $avg: "$Salary"
- },
- M
- in_Salary: {
- $min: "$Salary"
- },
- Max_Salary: {
- $max: "$Salary"
- }
- }
- }, {
- $project: {
- Sum_Salary: 1,
- Max_Salary: 1,
- _id: 0
- }
- }, {
- $redact: {
- $cond: {
- if: {
- $gt: ["$Sum_Salary", 30000]
- },
- then: "$$PRUNE",
- else: "$$DESCEND"
- }
- }
- }])
Output
- {
- "Sum_Salary": 25000,
- "Max_Salary": 25000
- }
- {
- "Sum_Salary": 25000,
- "Max_Salary": 25000
- }
In this example we use a redact stage. This stage restricts the contents of the documents based on information stored in the documents themselves. In the preceding $redact stage we use two system variables. The first variable is ”
$$PRUNE”, this variable excludes all the fields present in the current document or embedded at the document level . The second variable is “
$$DESCEND” that returns all the fields of the current documents level, excluding embedded level documents. So the preceding query excludes all the documents where the value of “Sum_Salary” is greater than 30000.
Map-ReduceMap-Reduce is another way to do aggregation. Map-Reduce is a combination of two operations. The first part of Map-Reduce is “Map”, that processes each document and retrieves one or more objects for each input document. The second part of Map-Reduce is “Reduce” that combines the result of the Map operations. Map-Reduce handles large amounts of data into aggregate results.
How Map-Reduce operation Works
For Map-Reduce operations MongoDB provides the mapReduce database command. In a Map-Reduce operation MongoDB first applies the map operation on all the documents that match the query condition. The result of the map operation generates the key-value pairs. For those keys that have more than one value, MongoDB applies reduce operations that collect and condense the aggregate result. Then MongoDB stores the results in a collection. Map-Reduce also contains an optional function, finalize. We can pass the output of the reduce operation to the finalize function to further condense the results of the aggregate operation.
MongoDB uses a JavaScript function in map-reduce aggregation. Map-Reduce aggregates may return the documents as output or may write the result to a collection. Map-Reduce uses the JavaScript functions that provide high flexibility.
Let us look at some example for Map-Reduce aggregations.
Example 1
To perform the mapReduce function for any collection, we must perform the following three steps. Now we will understand these procedures using the “Demo” collection.
Step 1
Define the map function.
- function()
- {
- emit(this.Age,this.Salary);
- } ;
In the function “this” defines the current collection name. This function maps the Salary to Age for each document and emits the Age and Salary pairs.
Step 2
Define the reduce function.
- function(key,values)
- {
- return Array.sum(values);
- }
Define the reduce function with the two arguments, key and values. The values parameter is an array type that is emitted by the map function and contains salary values and grouped by Age. This function reduces the values array to a sum of its elements.
Step 3
Output Collection.
This operation stores the result in the “My_Coll” collection. If the collection already exists then the contents of the collection will replace the contents of the mapReduce function.
The combined form of these three steps is:
- db.Demo.mapReduce(function()
- {
- emit(this.Age, this.Salary);
- },
- function(key, values)
- {
- return Array.sum(v alues);
- },
- {
- query:
- {
- Age:
- {
- $gt: 20
- }
- },
- out: "My_Coll"
- })
So when we execute the preceding query MongoDB returns the following outputs as acknowledgement.
- {
- "result" : "My_Coll",
- "timeMillis" : 4,
- "counts" :
- {
- "input" : 7,
- "emit" : 7,
- "reduce" : 1,
- "output" : 6
- },
- "ok" : 1
- }
The preceding result defines a number of input documents, number of emit documents, number of reduce documents, number of outputs and time to execute the mapReduce command.
The result of the preceding query will be stored in the “My_Coll” collection so we can use the find() method to retrieve the result from the “My_Coll” collection.
- db.My_Coll.find().pretty()
Output
- {
- "_id": 21,
- "value": 25000
- }
- {
- "_id": 22,
- "value": 55000
- }
- {
- "_id": 23,
- "value": 37000
- }
- {
- "_id": 25,
- "value": 25000
- }
- {
- "_id": 28,
- "value": 35000
- }
- {
- "_id": 32,
- "value": 37000
- }
Example 2
Step 1
- var map_func = function()
- {
- emit(this.Age, this.Salary);
- };
Step 2
- var reduce_func = function(key, values)
- {
- return Array.sum(values);
- };
Step 3
- db.Demo.mapReduce(map_func, reduce_func,
- {
- query:
- {
- Age:
- {
- $gt: 23
- }
- },
- out: "My_Coll"
- })
Output
- {
- "result": "My_Coll",
- "timeMillis": 4,
- "counts":
- {
- "input": 3,
- "emit": 3,
- "reduce": 0,
- "output": 3
- },
- "ok": 1
- }
The following is the data of My_Coll collection:
- db.My_Coll.find().pretty()
Output
- {
- "_id": 25,
- "value": 25000
- }
- {
- "_id": 28,
- "value": 35000
- }
- {
- "_id": 32,
- "value": 37000
- }
In this example we will define the map and reduce functions in two variables and use both variables in the mapReduce function. We also define the query for collections that define the selection criteria.
Example
- db.Demo.mapReduce(
- function()
- {
- emit(this.Age, this.Salary);
- },
- function(key, values)
- {
- return Array.avg(v alues);
- }, {
- query:
- {
- Age:
- {
- $gt: 20
- },
- Salary:
- {
- $gte: 24000
- }
- },
- out: "My_Coll",
- limit: 4,
- sort:
- {
- Salary: 1
- }
- })
Output
- {
- "result": "My_Coll",
- "timeMillis": 5,
- "counts":
- {
- "input": 4,
- "emit": 4,
- "reduce": 1,
- "output": 3
- },
- "ok": 1
- }
- }
The following is the data of My_Coll collection:
- db.My_Coll.find().pretty()
Output
- {
- "_id": 21,
- "value": 25000
- }
- {
- "_id": 22,
- "value": 27500
- }
- {
- "_id": 25,
- "value": 25000
- }
Today we read the aggregation operation in MongoDB. Aggregation is an important part of MongoDB. That computes the records and filters out the results.
Thanks for reading this article.