Using DataLoader to batch SQL queries

Gajus Kuizinas - Oct 25 '19 - - Dev Community

Facebook DataLoader is a generic utility used to abstract request batching and caching.

I have discovered DataLoader as part of my research to solve N+1 problem arising when using GraphQL. DataLoader is referred to as a solution to reduce the number of round-trips. However, I have read the documentation and it wasn't immediately clear what it does or how it works; at first sight, it appeared to be a simple key-value cache storage with a getter function.

In this post I illustrate how DataLoader achieves request batching.

First of all, let's create an example of the N+1 problem.

N+1 problem

The N+1 problem arises when you have a data collection and each child of that collection owns another collection, e.g. a list of posts and each post has a list of tags.

A naive object-relational mapping (ORM) implementation will first fetch the posts and then will make a query for each post to get the tags. Using an example of MySQL, that would result in 5 queries:

mysql> SELECT * FROM `post`;
+----+------+
| id | name |
+----+------+
|  1 | huhu |
|  2 | lala |
|  3 | keke |
|  4 | koko |
+----+------+
4 rows in set (0.00 sec)

mysql> SELECT `tag`.*, `post_tag`.`post_id` FROM `post_tag` INNER JOIN `tag` ON `tag`.`id` = `post_tag`.`tag_id` WHERE `post_tag`.`post_id` = 1;
+----+------+---------+
| id | name | post_id |
+----+------+---------+
|  1 | toto |       1 |
|  2 | titi |       1 |
+----+------+---------+
2 rows in set (0.00 sec)

mysql> SELECT `tag`.*, `post_tag`.`post_id` FROM `post_tag` INNER JOIN `tag` ON `tag`.`id` = `post_tag`.`tag_id` WHERE `post_tag`.`post_id` = 2;
+----+------+---------+
| id | name | post_id |
+----+------+---------+
|  1 | toto |       2 |
|  1 | toto |       2 |
+----+------+---------+
2 rows in set (0.00 sec)

mysql> SELECT `tag`.*, `post_tag`.`post_id` FROM `post_tag` INNER JOIN `tag` ON `tag`.`id` = `post_tag`.`tag_id` WHERE `post_tag`.`post_id` = 3;
+----+------+---------+
| id | name | post_id |
+----+------+---------+
|  4 | tutu |       3 |
+----+------+---------+
1 row in set (0.00 sec)

mysql> SELECT `tag`.*, `post_tag`.`post_id` FROM `post_tag` INNER JOIN `tag` ON `tag`.`id` = `post_tag`.`tag_id` WHERE `post_tag`.`post_id` = 4;
+----+------+---------+
| id | name | post_id |
+----+------+---------+
|  1 | toto |       4 |
|  2 | titi |       4 |
+----+------+---------+
2 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

ORMs are smart

A lot of the ORMs optimize queries to fetch collection data. Here is an example using Bookshelf ORM to fetch the data from the previous example:

import createKnex from 'knex';
import createBookshelf from 'bookshelf';

const knex = createKnex({
  client: 'mysql',
  connection: {
    host: '127.0.0.1',
    user: 'root',
    database: 'blog'
  }
});

const bookshelf = createBookshelf(knex);

const Post = bookshelf.Model.extend({
  tableName: 'post',
  tags: function () {
    return this.belongsToMany(Tag)
  }
});

const Tag = bookshelf.Model.extend({
  tableName: 'tag'
});

bookshelf
  .Collection
  .extend({
    model: Post
  })
  .forge()
  .fetch({
    withRelated: 'tags'
  });
Enter fullscreen mode Exit fullscreen mode

The latter will fetch data using two queries:

select
  `post`.*
from
  `post`;

select
  `tag`.*,
  `post_tag`.`post_id` as `_pivot_post_id`,
  `post_tag`.`tag_id` as `_pivot_tag_id`
from
  `tag`
inner join
  `post_tag`
on
  `post_tag`.`tag_id` = `tag`.`id`
where
  `post_tag`.`post_id` in (?, ?, ?, ?);

Enter fullscreen mode Exit fullscreen mode

The problem arises when you know only the parent node when requesting its dependencies. You will observe this patten in a system where each node is responsible for fetching its own dependencies, e.g. GraphQL.

GraphQL nested queries

One of the key aspects of GraphQL is its ability to nest queries.

To continue with the blog example, lets implement a schema that would enable us to fetch all posts and their tags, i.e. a schema that supports the following query:

{
  posts {
    id,
    name,
    tags {
      id,
      name
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The schema implementation:

import {
  graphql,
  GraphQLID,
  GraphQLInt,
  GraphQLList,
  GraphQLNonNull,
  GraphQLObjectType,
  GraphQLSchema,
  GraphQLString
} from 'graphql';

// For definition of Post and Tag, refer
// to the previous examples in the article.
import {
  Post,
  Tag
} from './models';

const TagType = new GraphQLObjectType({
  name: 'Tag',
  fields: () => {
    return {
      id: {
        type: new GraphQLNonNull(GraphQLID)
      },
      name: {
        type: new GraphQLNonNull(GraphQLString)
      }
    };
  }
});

const PostType = new GraphQLObjectType({
  name: 'Post',
  fields: () => {
    return {
      id: {
        type: new GraphQLNonNull(GraphQLID)
      },
      name: {
        type: GraphQLString
      },
      tags: {
        type: new GraphQLList(TagType),
        resolve: (post) => {
          return Post
            .forge({
              id: post.id
            })
            .load('tags')
            .call('related', 'tags')
            .call('toJSON');
        }
      }
    };
  }
});

const QueryType = new GraphQLObjectType({
  name: 'Query',
  fields: {
    posts: {
      type: new GraphQLList(PostType),
      resolve: (root) => {
        return Post
          .fetchAll()
          .call('toJSON');
      }
    }
  }
});

const schema = new GraphQLSchema({
  query: QueryType
});
Enter fullscreen mode Exit fullscreen mode

This example assumes that you have at least minimal knowledge of GraphQL. If you are not familiar with GraphQL, then a simple takeaway is:

  • posts node implements resolve method that fetches data for all Post models.
  • post has a property tags. tags resolve method has access to the data about the post.
  • post has a property tags. tags resolve method fetches data for all Tag models using the post id value.

In this example, tags method only knows about a single post at the time of being called, i.e. it does not know whether you are fetching information about a single post or many posts. As a result, executing the earlier query will result in N+1 problem, e.g.

const query = `{
  posts {
    id,
    name,
    tags {
      id,
      name
    }
  }
}`;

graphql(schema, query)

Enter fullscreen mode Exit fullscreen mode

The latter produces the following queries:

select `post`.* from `post`;
select `tag`.*, `post_tag`.`post_id` as `_pivot_post_id`, `post_tag`.`tag_id` as `_pivot_tag_id` from `tag` inner join `post_tag` on `post_tag`.`tag_id` = `tag`.`id` where `post_tag`.`post_id` in (1);
select `tag`.*, `post_tag`.`post_id` as `_pivot_post_id`, `post_tag`.`tag_id` as `_pivot_tag_id` from `tag` inner join `post_tag` on `post_tag`.`tag_id` = `tag`.`id` where `post_tag`.`post_id` in (2);
select `tag`.*, `post_tag`.`post_id` as `_pivot_post_id`, `post_tag`.`tag_id` as `_pivot_tag_id` from `tag` inner join `post_tag` on `post_tag`.`tag_id` = `tag`.`id` where `post_tag`.`post_id` in (3);
select `tag`.*, `post_tag`.`post_id` as `_pivot_post_id`, `post_tag`.`tag_id` as `_pivot_tag_id` from `tag` inner join `post_tag` on `post_tag`.`tag_id` = `tag`.`id` where `post_tag`.`post_id` in (4);

Enter fullscreen mode Exit fullscreen mode

Using DataLoader to batch queries

DataLoader is used to create a data loader. DataLoader is constructed using a batch loading function. A batch loading function accepts an array of keys, and returns a promise which resolves to an array of values.

Use the resulting data loader function to load values. DataLoader will coalesce all individual loads which occur within a single tick of an event loop and then call your batch loading function.

This definition of the DataLoader is taken more or less verbatim from the documentation. It sounds cool, but it didn't make much sense to me.

Lets use DataLoader to fix the N+1 problem in the blog example.

First, I need a function that can load a batch of tags in one query.

const getPostTagsUsingPostId = (postIds) => {
  return Post
    .collection(postIds.map((id) => {
      return {
        id
      };
    }))
    .load('tags')
    .call('toJSON')
    .then((collection) => {
      // Bookshelf 0.10.0 uses Bluebird ^2.9.4.
      // Support for .mapSeries has been added in Bluebird v3.
      return collection.map((post) => {
        return post.tags;
      });
    });
};

Enter fullscreen mode Exit fullscreen mode

getPostTagsUsingPostId will construct a single query to fetch tags for a list of post IDs, e.g.

select
  `tag`.*,
  `post_tag`.`post_id` as `_pivot_post_id`,
  `post_tag`.`tag_id` as `_pivot_tag_id`
from
  `tag`
inner join
  `post_tag`
on
  `post_tag`.`tag_id` = `tag`.`id`
where
  `post_tag`.`post_id` in (?)

Enter fullscreen mode Exit fullscreen mode

Second, I need to create a data loader function:

import DataLoader from 'dataloader';

const TagByPostIdLoader = new DataLoader(getPostTagsUsingPostId);

Enter fullscreen mode Exit fullscreen mode

Finally, I need PostType to use the .load() function of the resulting data loader object to resolve data:

const PostType = new GraphQLObjectType({
  name: 'Post',
  fields: () => {
    return {
      id: {
        type: new GraphQLNonNull(GraphQLID)
      },
      name: {
        type: GraphQLString
      },
      tags: {
        type: new GraphQLList(TagType),
        resolve: (post) => {
          return TagByPostIdLoader.load(post.id);
        }
      }
    };
  }
});

Enter fullscreen mode Exit fullscreen mode

Now, lets rerun the earlier query:

const query = `{
  posts {
    id,
    name,
    tags {
      id,
      name
    }
  }
}`;

graphql(schema, query)
Enter fullscreen mode Exit fullscreen mode

This time, we have fetched the data using just two queries:

select `post`.* from `post`
select `tag`.*, `post_tag`.`post_id` as `_pivot_post_id`, `post_tag`.`tag_id` as `_pivot_tag_id` from `tag` inner join `post_tag` on `post_tag`.`tag_id` = `tag`.`id` where `post_tag`.`post_id` in (1, 2, 3, 4)
Enter fullscreen mode Exit fullscreen mode

How does it work?

Hopefully, now the earlier description makes more sense:

DataLoader is used to create a data loader. DataLoader is constructed using a batch loading function. A batch loading function accepts an array of keys, and returns a promise which resolves to an array of values.

Use the resulting data loader function to load values. DataLoader will coalesce all individual loads which occur within a single tick of an event loop and then call your batch loading function.

If you are still struggling, I have made an interactive example:

// tonic ^6.0.0
const DataLoader = require('dataloader');

const getPostTagsUsingPostId = (ids) => {
  console.log(ids);

  return Promise.resolve(ids);
};

const TagByPostIdLoader = new DataLoader(getPostTagsUsingPostId);

TagByPostIdLoader.load(1);
TagByPostIdLoader.load(2);
TagByPostIdLoader.load(3);

// Force next-tick
setTimeout(() => {
  TagByPostIdLoader.load(4);
  TagByPostIdLoader.load(5);
  TagByPostIdLoader.load(6);
}, 100);

// Force next-tick
setTimeout(() => {
  TagByPostIdLoader.load(7);
  TagByPostIdLoader.load(8);
  TagByPostIdLoader.load(9);
}, 200);

setTimeout(() => {
  TagByPostIdLoader.load(10);
  TagByPostIdLoader.load(11);
  TagByPostIdLoader.load(12);
}, 200);
Enter fullscreen mode Exit fullscreen mode

To sum up

DataLoader allows you to decouple unrelated parts of your application without sacrificing the performance of batch data-loading. While the loader presents an API that loads individual values, all concurrent requests will be coalesced and presented to your batch loading function. This allows your application to safely distribute data fetching requirements throughout your application and maintain minimal outgoing data requests.[^https://github.com/facebook/dataloader/blame/68a2a2e9a347ff2acc35244ae29995ab625b2075/README.md#L88]

I told you – it is magic.

Further reading

I thoroughly recommend reading the source code of the DataLoader (less than 300 lines of code). Ignoring the cache logic, the underlying implementation is a simple queue that is using process.nextTick to resolve an array of promises. Yet, it is a genius application.

Finally, know that each DataLoader instance represents a unique cache. After being loaded once, the resulting value is cached, eliminating the redundant requests. You can leverage this to create a cache persisting throughout the life-time of the application, or create a new instance per each request. Continue to read about DataLoader caching to learn about cache invalidation.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .