In this article, I present an experiment using one query for two different designs in MongoDB, with one having a significant reduction in the workload compared to the other. In other words, the query is exactly the same; however, using some optimizing techniques in the database can make a significant difference. The system went from scanning 1,000,000 documents to only 2651 documents to complete the same task. The article also notes that understanding the business use case is the key.
To replicate this experiment, some basic knowledge is required to install MongoDB tools such as MongoDB Shell or MongoDB Compass to connect your local machine to your MongoDB, or it is fine to use MongoDB locally.
Data simulation
First of all, we need some data. Therefore, I have some code to simulate a collection of 1,000,000 customers, as shown below. After running the simulation code, it creates a collection named "customers" containing 1,000,000 documents with the fields of name, age, address, email, phone number, and profession.
// create a function to generate random customers
function generateRandomCustomer() {
const specificNames = ["A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8"];
const professions = ["B1", "B2", "B3", "B4", "B5", "B6", "B7", "B8"];
const randomName = specificNames[Math.floor(Math.random() * specificNames.length)];
const randomAge = Math.floor(Math.random() * 50 + 20);
const randomAddress = "Address" + Math.floor(Math.random() * 1000);
const randomEmail = randomName.toLowerCase() + "@email.com";
const randomPhoneNumber = "123-456-789";
const randomProfession = professions[Math.floor(Math.random() * professions.length)];
return {
name: randomName,
age: randomAge,
address: randomAddress,
email: randomEmail,
phoneNumber: randomPhoneNumber,
profession: randomProfession
};
}
// simulate 1000000 customers
const countDocuments = 1000000
const randomCustomers = [];
for (let i = 0; i < countDocuments; i++) {
randomCustomers.push(generateRandomCustomer());
}
// insert the customers to the database using a batch size of 10000
const batchSize = 10000;
for (let i = 0; i < randomCustomers.length; i += batchSize) {
const batch = randomCustomers.slice(i, i + batchSize);
db.customers.insertMany(batch);
}
There is one note about the batch size in the simulation code. To insert a large number of data into the database, we need to divide the whole dataset into smaller batches for efficiency, and I chose a batch size of 10,000 for this experiment.
Overview of the data
Use the following command in MongoDB Shell to view the example of one document.
db.customers.findOne()
{
_id: ObjectId('6684b910294f269e10fd8498'),
name: 'A7',
age: 29,
address: 'Address438',
email: 'a7@email.com',
phoneNumber: '123-456-789',
profession: 'B3'
}
Check the number of documents
db.customers.countDocuments()
1000000
We are sure that we have a collection of 1,000,000 customers with their information.
Experiment
Imagine if I need to find a customer named "A1" and the age of 36. In a real-world use case, it could be a task to query all transactions, given the customer's name and the date.
The query is as follows.
db.customers.find({name:"A1", age:36})
Query output
Atlas atlas-4gwxni-shard-0 [primary] justatest> db.customers.find({name:"A1", age:36})
[
{
_id: ObjectId('6684b910294f269e10fd84aa'),
name: 'A1',
age: 36,
address: 'Address534',
email: 'a1@email.com',
phoneNumber: '123-456-789',
profession: 'B5'
},
...
{
_id: ObjectId('6684b910294f269e10fd85ff'),
name: 'A1',
age: 36,
address: 'Address802',
email: 'a1@email.com',
phoneNumber: '123-456-789',
profession: 'B3'
},
In my experiment, it returns 2561 results. It is noted that the simulation is random; therefore, the number of results could be different between different trials.
Cost examination
Now, it comes to the interesting part when I check the workload the system needs to do to implement the query. To do that, I use the command "explain()" as shown below.
db.customers.find({name:"A1", age:36}).explain("executionStats");
The output is lengthy. Therefore, I only focus on the important lines.
totalKeysExamined: 0,
totalDocsExamined: 1000000,
executionStages: {
stage: 'COLLSCAN',
filter: {
'$and': [ { age: { '$eq': 36 } }, { name: { '$eq': 'A1' } } ]
},
nReturned: 2561,
executionTimeMillisEstimate: 455,
The output above shows that the system must go through 1,000,000 documents to complete the work (i.e., returning 2561 results). The reason is that the algorithm used to implement the query is "COLLSCAN." It simply means that the system must scan the entire collection to complete the work.
Another way to view the cost is to click the "explain" button in the MongoDB Compass.
Database optimization
The workload seems to be heavy. Introducing indexes to the database is a commonly used technique to reduce the workload. We can do that using the command "createIndex()".
db.customers.createIndex({name: 1, age: 1}, {name: "IDX_NAME_AGE"});
IDX_NAME_AGE
What the command does is it creates indexes for the fields of name and age. We can check the indexes as follows.
db.customers.getIndexes();
[
{ v: 2, key: { _id: 1 }, name: '_id_' },
{ v: 2, key: { name: 1, age: 1 }, name: 'IDX_NAME_AGE' }
]
Querying after optimization
The next step is to use the "explain" command to check the workload required to complete the same query.
db.customers.find({name:"A1", age:36}).explain("executionStats");
Output
winningPlan: {
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
...
totalKeysExamined: 2561,
totalDocsExamined: 2561,
executionStages: {
...
nReturned: 2561,
executionTimeMillisEstimate: 6,
The output demonstrates that the workload gets significantly reduced. The system went from scanning the entire 1,000,000 documents to only scanning 2561 documents. It might be observed that the execution time drops significantly; however, the execution time can vary a lot depending on the physical machine. Therefore, I only want to draw your attention to the workload.
Important notes
As presented above, using indexes can be very helpful. However, the mechanism behind introducing indexes is that the database creates a smaller table and sorts all the data into order. The simple imagination is that you have a line of 1 million people of different ages and names. The task is to find people with names such as "A1" and the age of 36. The only way to complete the task is to come over and ask everyone making up the workload to ask 1,000,000 times. However, if you have sorted the people in the line by their names and ages, it becomes much easier to find the expected ones.
The database creates a smaller table with sorted information. It physically takes up some space in your storage; creating redundant indexes might not be considered a good practice. It, therefore, highlights the importance of understanding the business. Certain fields might be used for querying much more often than others in a specific use case of a business. Those often-used data must be prioritized when designing database configuration, while the not-frequently-used ones can be treated differently.
There is still much more to mention about indexing, partitioning and other techniques and strategies for database optimization. I believe that understanding those things would be significantly beneficial for developers.
Thank you for reading this far!
Have a nice day
Hoang
P/S: I will come back to this topic for another experiment.