Every once in a while I read Lucian Ghinda's Short Ruby newsletter and I see a "take" there which begs for some elaboration and at least slight disagreement. So I decided to try to write these short (but too long for a tweet) replies here.
Today's topic comes from issue #59, where Matt Swanson touches the topic of data migrations.
(here is a link to the tweet)
So what's wrong here? I think the first of all we need to divide what Matt conveniently put together, because "don't reference Rails models in migrations" and "don't run data migrations in schema migrations" are two separate rules coming from different backgrounds.
Don't reference models
The rule of not referencing models relies on the fact that the code of the models change and it might change a lot. In fact, the assumption here is that the migration referencing a model will successfully run on Monday, but won't run on Thursday, because the model file changed and the assumptions made in migration code are no longer correct.
I generally agree that the points in the tweet can greatly reduce the risk of that happening. Although at the same time these points are not realistic. Just how would you "run migrations in dev often" during your two-weeks vacations? Fortnight is definitely enough time in larger projects for model and migration to go out of sync, resulting in a migration not being able to run.
However, I don't want to sweat on it, because that's up to your team if you want to take this risk. The second part is much more interesting.
Don't run data migrations along schema migrations
Let's say it out loud: the migrations were thought as a tool to consistently modify database schema across the environments. If you don't believe me, just check how the table keeping track of the migrations already run is called.
However, in time, people started to use it to run data migrations too. It have few obvious upsides:
- You already have a tool for that
- Everyone in your organization is expected to run migrations in dev environment regularly, so they will have their data in sync automatically. You don't need to announce on Slack that "everyone please run
rake data:backfill_order_numbers
". - You already have a step of running it in your deployment pipeline and you don't have to add anything new
In my experience, these advantages start to fade when the project (and especially the database) grows. Here are some problem I encountered with this approach:
- Data migrations can take a lot of time. If you put schema change with data migration in the same file, you might and up with a table locked exclusively for a prolonged amount of time, effectively bringing your app down.
- Data in production database is often far more exotic than in staging/test environments. The migration might pass on staging, but will fail on production, because of some weird data in the database. Now your pipeline is blocked and you either have to revert or submit a hotfix.
- It's not uncommon to have parameterized data migrations, where you first run it for only some tenants or just in some countries. When you check that everything works fine, you proceed with another batch.
- You have little control over when the data migration is run. In my current company, for example, you just put your PR in a merge queue and you don't really know when it will be merged (and deployed). Sometimes it's hours later. With data migrations, that might put significant stress on the database, it's better to have very strict control when it's run.
- Similarly, you might want to merge and deploy the code during working hours, but run the data migration in off-hours (or even during the weekend). But you have coupled one to the other, so you cannot really do that.
- Last but definitely not least: if you write your data migration as a rake task, you can actually write tests for it. Do you write tests for your migrations?
To me personally, these points highly outweigh the comfort of using schema migrations for data migrations. Sure, you lost the comfort of using the tool that's already there, but at the same time you put the stability of the app, the well-being of the whole team and your peace of mind at risk.
So I guess my "counter-take" is:
At least think of having the process of running data migrations in separation from your schema migrations, knowing what's at stake.