Introduction

One of the important steps of planning a application is how you will represent your data. How will you organize your database's collections and documents? A poorly designed data representation can slow your application down by forcing you to make unnecessary queries to get the data you need.

What is normalization?

When you normalize your data, you are dividing your data into multiple collections with references between those collections. Each piece of data will be in a collection, but multiple documents will reference it. This means, to change your data, you only need to update one document, because that data is only defined once. However, MongoDB doesn't have any join facilities, like SQL does. Therefore, if you need data from several collections, you will need to perform several queries.

Let's see an example of normalization. We have a users collection. We store each user's preferences in a accountsPref collection. We store each article written by users in an articles collection.

In a normalized example, our collections could look like this:

db.users.findOne({_id: userId})
{
  _id: ObjectId("5977aad83abbae8aef44b47b"),
  name: "John Doe",
  email: "johndoe@gmail.com",
  articles: [
    ObjectId("5977aad83abbae8aef44b47a"),
    ObjectId("5977aad83abbae8aef44b478"),
    ObjectId("5977aad83abbae8aef44b477")
  ],
  accountsPref: ObjectId("5977aad83abbae8aef44b476")
}

db.accountsPref.findOne({_id: id})
{
  _id: ObjectId("5977aad83abbae8aef44b490"),
  userId: ObjectId("5977aad83abbae8aef44b47b"),
  showFriends: true,
  notificationsOne: false,
  style: "light"
}

We store references of each piece of data in our collections. This is a frequent way to do things in a relational database. But, in MongoDB, you probably don't want to store data this way. It requires a lot of queries to get the data you want. To have a user's informations, you'll need to do 3 trips to the database. One for users, one for accountsPref, another for articles. Meh...

Denormalization

You want to optimize your reads. You don't want three trips to the database to get your informations. We could store the accounts preferences of each user as an embedded document, like so:

{
  _id: ObjectId("5977aad83abbae8aef44b47b"),
  name: "John Doe",
  email: "johndoe@gmail.com",
  articles: [
    ObjectId("5977aad83abbae8aef44b47a"),
    ObjectId("5977aad83abbae8aef44b478"),
    ObjectId("5977aad83abbae8aef44b477")
  ],
  accountsPref: {
    style: "light",
    showFriends: true,
    notificationsOn: false
  }
}

The advantage of this is that you need one less query to get the information. The downside is that it takes up more space and is more difficult to keep in sync. For example, we decide that the light style should be renamed day. We would have to update every single document where the user.accountsPref.style was light.

You can use an hybrid of referencing and embedding. You could keep the subdocument, you add the reference of your account preferences but you only put the most frequently used fields in the subdocument. If you know that only the style field is frequently used by your app, you could do this:

{
  _id: ObjectId("5977aad83abbae8aef44b47b"),
  name: "John Doe",
  email: "johndoe@gmail.com",
  articles: [
    ObjectId("5977aad83abbae8aef44b47a"),
    ObjectId("5977aad83abbae8aef44b478"),
    ObjectId("5977aad83abbae8aef44b477")
  ],
  accountsPref: {
    _id: ObjectId("5977aad83abbae8aef44b490"),
    style: "light"
  }
}

This can be a nice approach because your requirements may change over time. If you want to include more or less info on the page, you can always add or remove fields from the embedded document.

Considerations

It's always complicated to know how to design your data. An important consideration is how often your data changes versus how often you read it. Normalization will provide an update efficient data representation. Denormalization will make data reading efficient.

Should you store accounts preferences inside each user's document? Generally yes. It's only relevant to this user. It probably won't change a lot either.

It's probably a good idea to keep a user's address inside your document too. It will not be updated regularly, so you should not penalize every read for the relevance of this information.

Here are some guidelines to help you when considering this issue:

Embedding:

You have small subdocuments
Your data does not need to change regularly
You don't need immediate consistency ( not up-to-date )
Your documents grow by a small amount
You need this data to perform a second query
You want faster reads

Referencing:

You have large subdocuments
Your data changes frequently
You need your data to be up-to-date
Your documents grow by a large amount
Your data is often excluded from your results
You want faster writes

MongoDB: Normalization vs Denormalization