Fetching Millions of Rows with Streams in Node.js

Patrick God - Feb 13 '20 - - Dev Community

Have you ever had to face the challenge of fetching several million rows of data with lots of columns from a database and display them on the web?

Well, I had to do this recently. Although I was not convinced that this would make any sense at all, I still had to do it.

Anyways, here's the solution.

But first, the technology stack: Node.js, Sequelize & MariaDB. The client doesn't matter, because at times the data was almost 4 GB big, so Chrome crashed anyway eventually.

Sequelize was the first big problem.

The solution to getting such big data results is streaming. Receiving that data with one big call led to Node crashing. So streaming is the answer, and Sequelize did not support streaming at that point.

Well, I was able to manually add streaming to Sequelize, but the service call takes ages in the end.

Here's a snippet of a usual Sequelize call:



await sequelize.authenticate();
const result = await sequelize.query(sql, { type: sequelize.QueryTypes.SELECT });
res.send(result);


Enter fullscreen mode Exit fullscreen mode

That's it. Of course, some parts are missing. Like the whole database configuration and the actual definition of the get() call (e.g. where does res come from?). But I think, you get the idea.

When you run this code, the result is simple. Node crashes. You could give node more memory with --max-old-space-size=8000, for instance, but that's not a real solution.

As already mentioned, you could kind of force Sequelize to stream the data. Now, how would that look like?



var Readable = stream.Readable;
var i = 1;
var s = new Readable({
    async read(size) {
        const result = await sequelize.query(
            sql + ` LIMIT 1000000 OFFSET ${(i - 1) * 1000000}`, { type: sequelize.QueryTypes.SELECT });
        this.push(JSON.stringify(result));
        i++;
        if (i === 5) {
            this.push(null);
        }
    }
});
s.pipe(res);


Enter fullscreen mode Exit fullscreen mode

In this example, I knew the number of rows I would get back from the database, hence the line with if (i === 5). It was just a test. You have to send null to end the stream. You could, of course, get the count of the whole result first and modify the code accordingly.

The whole idea behind this is to make smaller database calls and return the chunks with the help of the stream. This works, Node does not crash, but it still takes ages - almost 10 minutes for 3.5 GB.

Streaming with Sequelize

What's the alternative?

The MariaDB Node.js connector.

That's how a usual query would look like:



const mariadb = require('mariadb');
const pool = mariadb.createPool({ host: "HOST", user: "USER", password: "PASSWORD", port: 3308, database: "DATABASE", connectionLimit: 5 });
let conn = await pool.getConnection();
const result = await conn.query(sql);
res.send(result);


Enter fullscreen mode Exit fullscreen mode

It's much faster. But let me jump right to the streaming code:



let conn = await pool.getConnection();
const queryStream = conn.queryStream(sql);
const ps = new stream.PassThrough();
const transformStream = new stream.Transform({
    objectMode: true,
    transform: function transformer(chunk, encoding, callback) {
        callback(null, JSON.stringify(chunk));
    }
});
stream.pipeline(
    queryStream,
    transformStream,
    ps,
    (err) => {
        if (err) {
            console.log(err)
            return res.sendStatus(400);
        }
    })
ps.pipe(res);


Enter fullscreen mode Exit fullscreen mode

This may look a bit cryptic, but what happens here is, that you create a pipeline where you put stuff through. First, the queryStream which is the result of the database query. Then the transformStream to send the stringified chunks (only strings and buffers are allowed here, hence stringifying the object). And finally the PassThrough and a function for an error case.

With ps.pipe(res) you stream the result to the client.

And here's the result:
Streaming with MariaDB Node.js connector

Under 4 minutes for the same data and you won't even notice that Node needs a bit RAM.

So, if you're challenged with a similar task, think about streaming the data.

Or you convince your client, that this kind of requirement is unrealistic for the web.

P.S. Pagination was not an option. We needed the whole data at once.

Image created by brgfx on freepik.com.


But wait, there’s more!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .