Running a safe Postgres migration

Kevin Garcia - Mar 3 '21 - - Dev Community

This post is written by Peter Johnston, a software engineer at Retool, and was originally hosted on the Retool blog. But we wanted to know, what are we missing?

--

A few weeks ago, I unintentionally took Retool down for 2 minutes when an out-of-control database migration tried to rewrite a mission critical data table.

As my teammate and I cleaned up the mess—caused by incorrectly using transactions in the migration—I reflected on my last few years working with Postgres (officially PostgreSQL) and the different lessons I’ve learned when it comes to writing a safe database migration.

Database migrations aren’t easy; a migration that runs totally fine on your laptop may cause problems in production due to increased data and traffic volumes. And every time your app goes down, so does customer trust and revenue.

Over the last 3 years, our team at Retool has navigated a few pitfalls and fixed major issues related to migrations in Postgres. This post will highlight key lessons learned and what you can do to avoid making the same mistakes.

Our tech stack at Retool

Our team of about 25 Retool engineers are focused on building the fastest way to build internal applications. Every company builds employee-facing tools to support important work, and our platform provides a drag-and-drop interface to quickly build, integrate, and use internal tools.

We’ve got a pretty standard Javascript-centric tech stack, with a Typescript + React + webpack frontend, a Node + Express backend, and a Postgres database server. We use Sequelize as an ORM to communicate with Postgres.

It’s also important to note that we package Retool up in two ways: a cloud SaaS application and an application our enterprise customers can run inside their own networks, like Sentry or GitHub Enterprise. On-premise customers run Postgres themselves, and Retool engineers don’t have access to these databases. For these customers, we have to ship database migrations that can safely run automatically in any number of environments.

A common migration pitfall for Sequelize + Postgres users

Wrapping your schema modifications in a transaction is almost always the right call in Postgres, because it means your database will be left in a known-good state even if the migration fails.

For other Sequelize + Postgres users out there, be aware that the default behavior for Sequelize does not wrap migrations in a database transaction. As the migration author, you are responsible for using a transaction in every migration you write.

This stems from Sequelize being an ORM—as Sequelize supports database systems that don’t offer transactions for schema modifications—but this is one of the most common issues I’ve seen in the migrations written at Retool.

Using a transaction is easy. You just have to remember to do it!

module.exports = {
    up: function (queryInterface, Sequelize) {
        const db = queryInterface.sequelize;
        return db.transaction(async (transaction) => {
            await queryInterface.addColumn('users', 'resetPasswordToken', {
                type: Sequelize.TEXT,
                allowNull: true,
                defaultValue: null,
            }, { transaction });
            await queryInterface.addColumn('users', 'resetPasswordExpires', {
                type: Sequelize.DATE,
                allowNull: true,
                defaultValue: null,
            }, { transaction });
        });
    },
    ...
};
Enter fullscreen mode Exit fullscreen mode

How to structure common migration archetypes

Inspired by the strong_migrations ruby gem, a tool for catching risky migrations in Rails apps, this section discusses a few migration types that frequently cause downtime as well as downtime-mitigating strategies.

Deleting a column

Directly deleting a column is never wise. The key reason why is that you cannot deploy app changes and database schema changes atomically.

If you have more than one app server, your deploys happen over time: during the deploy, you have both the old version and new version simultaneously. Therefore, your database schema must be able to support both the old and new versions. Even if you have only one app server, you still must either deploy the code change or the schema change first, leading to the same time skew problem.

In the period between the application of your delete migration and the deploy of the new app code ignoring the column, the old app version cannot query the affected table, as the old version’s Sequelize is emitting queries referencing the dropped column name. Any web requests querying the table will likely 500. This could cause a major outage if the table in question is a popular one like users, likely queried in every request.

Instead, you should consider doing it in two phases:

  • First, remove all references in the app to the column as well as from the Sequelize models file. Deploy these changes. At this point, your app won’t be issuing any queries referencing the to-be-dropped column.
  • Next, write your migration dropping the column. This migration is now safe, because you’ve ensured the app won’t query the dropped column.

Renaming a column

Just like deleting a column, renaming a column in one shot is unwise for the same reason: code changes and schema changes cannot happen together atomically, so your database schema must support both old and new versions of the app.

If your tables are serving production traffic and you want to safely rename a column, you should do it gradually:

  • First, introduce a new column using your desired column name in a database migration, and update any writes in your app to the old column name to also write the new one as well. At this point, any new or updated rows in your table have the correct value in the new column name, but you need to ensure all existing rows do, too.
  • Next, write a data backfill (see below on tips for making that safe) to copy over the values from the old column name to the new column name. At this point, your new column name can be used as the source of truth, as you have ensured both that new rows and existing rows have up-to-date values under the new name. It’s time to delete the old column, and you’ll follow the same procedures as we discussed above.
  • Next, start to delete the old column name. Update the app code to ignore the old column: all reads should use the new column, and writes can start ignoring the old column. Also drop the old column name from the Sequelize model file.
  • Finally, write a second migration to remove the old column name.

Whew, that is a lot of steps. Renaming a column safely is so hard that I recommend you pay extra careful attention to your names when creating columns, so you don’t have to go through this unless it’s absolutely necessary.

Data backfills

Lots of apps, including Retool, use the database migration system for data backfills in addition to schema modifications. At least for now, we use our migration system for backfills to ensure our on-premise customers benefit from the backfills.

Backfills are risky because of lock contention. Done improperly, you may end up causing a critically important table to be unavailable for the duration of the backfill; for large tables of millions of rows, the backfill may take quite a long time to complete, leading to a total app outage.

Understanding locks can help you understand how backfills can lead to outages. Here is a brief overview of how locks work in Postgres:

  • Every Postgres statement, even the humble SELECT, acquires database table locks. This is how Postgres ensures concurrent access to data. In addition, certain statements, like UPDATEs or SELECT FOR UPDATEs, acquire row-level database locks.
  • Postgres locks are tracked at the “session” level. You can think of a session as being equivalent to a “connection”: a session is the combination of the client connection process (running on your app server) and the Postgres server process serving the connection.
  • Sessions hold on to all locks until the transaction finishes, either successfully via COMMIT or unsuccessfully via ROLLBACK.
  • Row level locks acquired during a write block concurrent writes. That is: once one connection has locked a row, attempts in other connections to write to that row will block until the first connection releases its lock. See the full Postgres docs for the details.

In light of the facts above, it’s easy to write a backfill operating over a large table that leads to an outage. If you use one transaction for the whole backfill, your backfill connection will acquire row-level locks for every row it updates, and then hold on to these row-level locks for the remainder of the backfill. Postgres will then block any other attempts to write to those rows until the backfill completes. This is very bad news if the table you’re backfilling is large and serves a lot of writes, like a sessions table.

The trick to resolving this is to batch your backfills. Instead of using one transaction for the entire backfill, you should break your backfills into batches of, say, 100 rows, and wrap each batch in a transaction instead. This ensures you only hold on to those row-level locks for a short time.

The downside, however, is that you lose the benefits of transactionality: your backfill may fail partway through, leaving only part of your data migrated. The solution to this is to ensure your backfill is reentrant: that multiple invocations of the backfill produce the same result as a single invocation. Fortunately, many common backfill types, like copying a value from one column to another column on the same row, are naturally reentrant.

Adding a column with a default

Cool kids on newer versions of Postgres: skip this section. In Postgres ≥ 11, adding a column with a default value is safe!

However, for anyone running older versions of Postgres, beware of adding a column with a default value. This causes Postgres to add your default value to all existing rows during the ALTER TABLE statement, resulting in a full table rewrite. Because ALTER TABLE acquires an ACCESS EXCLUSIVE table level lock, which blocks all other attempts to read or write the table, you’re effectively taking the table down while Postgres rewrites the entire table. This may be acceptable for small tables that can be quickly rewritten, but can easily lead to a major outage for large tables.

The solution is actually quite easy:

  • First, in your migration, add your column without a default value.
  • Then, in the same migration, alter the column you just added to have a default value.
  • Then, in a separate data backfill, in batches (because you read the paragraph above) update all rows to have your default value.

This might seem bizarre. Why does a single statement adding the column and the default lead to dangerous behavior, but splitting the statement into two ALTER TABLE s work just fine?

Essentially, altering an existing column to have a default value is a very different operation than adding a column with a default value. Altering an existing column to have a default is a small metadata operation: the new default value applies only to newly inserted rows or freshly updated rows. Old rows will continue to read out NULL until you explicitly update them, as we did in the above backfill. The key insight is to do the long, slow backfill separately from the ALTER TABLE operation acquiring the expansive ACCESS EXCLUSIVE lock.

Later versions of Postgres, starting at version 11, handle this for you. In modern Postgres, adding a column with a default value is also a small metadata operation: Postgres fills in any NULLs in your column with a default on readout instead of rewriting the entire table. Nice! It’s a great reason to upgrade if you’ve been putting off that chore.

Adding an index

In Postgres, adding an index acquires a SHARE table-level lock, which blocks concurrent writes to that table. When adding indices to big tables, this is a big deal, since index builds can take a long time for large tables. For a production app serving real traffic, this is likely a dealbreaker: rarely can you afford to stop all writes for an extended period of time.

Postgres provides an out-of-the-box solution to this: concurrent index builds. In exchange for taking longer, Postgres builds the index in the background, and when it’s done, transparently starts using the index to speed up queries. Nice!

module.exports = {
    // I skipped using a transaction because the migration is only one statement long.
    up: async (queryInterface, Sequelize) => {
        await queryInterface.addIndex('users', ['email'], {
            name: 'users_email_idx',
            concurrently: true,
            unique: true,
        });
    },
    down: async (queryInterface, Sequelize) => {
        await queryInterface.removeIndex('users', 'users_email_idx');
    },
};
Enter fullscreen mode Exit fullscreen mode

However, concurrent index builds aren’t perfect: they can be hazardous for index types, like unique indices, that may fail to build. This can result in an invalid index that isn’t used for querying even though the migration applied. Fixing this requires resolving the data issue leading to the index build failure, then manually dropping and re-adding the index.

For folks with on-premise customers like Retool, this might mean occasionally having to drop into Zoom debugging sessions to resolve these sorts of failed index builds, as you wont have direct access to your clients’ Postgres databases.

When things go wrong: statement timeout

Mistakes happen and things go sideways. You can’t just try to avoid mistakes. You also need to put measures in place to mitigate the fallout of mistakes that slip through the system.

At Retool, we have one Postgres setting configured that is nothing short of a life-safer: statement_timeout.

Abort any statement that takes more than the specified number of milliseconds, starting from the time the command arrives at the server from the client.

We currently have it set to two minutes. That is, Postgres will abort any database statement that takes more than two minutes. This was invaluable in our outage the other week, when a runaway UPDATE statement blocked concurrent writes on a very large, mission-critical table. Because we had the statement_timeout in place, after two minutes, Postgres automatically terminated the naughty UPDATE, and Retool rapidly recovered.

statement_timeout transformed what could have been a long, catastrophic outage into a 2 minute blip.

Normally, you should configure statement_timeout on the client side and not in postgresql.conf, which applies to all database session. However, to ensure that all common database write processes at Retool have this important parameter set, we actually configure our statement_timeout on the role level: alter role retool set statement_timeout to '2min';

Learn as you go

Getting database migrations right is tricky. There are a lot of subtle corner cases demanding extra diligence that only appear when your database is large or serving a lot of user traffic.

Migration safety a skill that all startups have to learn as they grow and become more essential in their customers’ lives. All of the dangerous examples above only affect production databases and don’t express at all in simpler scenarios, like your local laptop.

The above guide should serve you well for common pitfalls that we’ve encountered in our work with Postgres. Using it, I hope you avoid some of the pain we’ve encountered so you can focus on your creating more value for your customers and company.

. . . . . . . . . . . . . . . . . . . . . . . .