Cover Photo Credit: Pixels
Introduction
One of the most interesting challenges I’ve encountered is when adding new fields to an existing database table and needing to back-fill values for those fields. Often, back-filling involves performing some sort of aggregation or calculation—sometimes involving other tables or columns—and then updating the target table with those calculated values.
Let’s explore a concrete example using some sample data.
Problem Example
Imagine we decide to add two new fields to our events table:
-
number_of_ticket_types
: This will hold the count of different ticket types associated with each event. -
number_of_tickets_available
: This will represent the total number of tickets still available for the event, calculated as the sum ofticket.quantity
minusticket.quantity_sold
for each event.
These fields can be useful if, for instance, you are retrieving a list of events
and need to display ticket information without requiring a user to click into the details of each event.
We’ll start by adding a migration that adds these new columns to our events
table:
Fields | Type | Constraints |
---|---|---|
title | text | |
owner_id | text | FK -> Users |
id | VARCHAR(30) | PK |
number_of_ticket_types | INT | default:0 |
number_of_tickets_available | INT | default:0 |
However, when these new fields are added, their values will be set to 0 for all existing events, even though we already have tickets in the tickets table. Now, we need to backfill those fields with the correct data.
A First Attempt at Backfilling
One way to backfill these new fields is to retrieve data from the tickets table, perform calculations, and then update the events table accordingly.
Here’s an example of how to backfill the number_of_ticket_types field:
async function backFillTicketTypes() {
const ticketTypeResult = await TicketsModel.findAll({
attributes: [
"event_id",
[sequelize.fn("count", sequelize.col("id")), "ticket_type_count"],
],
group: ["Tickets.event_id"],
});
/*
Result would look like this:
ticketTypeResult = [
{event_id: 'eventId1', ticket_type_count: 10},
{event_id: 'eventId2', ticket_type_count: 4},
{event_id: 'eventId3', ticket_type_count: 3},
]
*/
for (const eventTicketTypeObject of ticketTypeResult) {
const { event_id, ticket_type_count } = eventTicketTypeObject;
await EventsModel.update(
{
ticketTypeCount: ticket_type_count,
},
{
where: {
id: event_id,
},
}
);
}
}
The second field, number_of_tickets_available
, can be backfilled similarly by calculating the total available tickets:
async function backFillNumberOfTicketsAvailable() {
const ticketsAvailableResult = await TicketsModel.findAll({
attributes: [
"event_id",
[sequelize.fn("sum", sequelize.col("quantity")), "total_tickets"],
[
sequelize.fn("sum", sequelize.col("quantity_sold")),
"total_tickets_sold",
],
],
group: ["Tickets.event_id"],
});
/*
Result would look like this:
ticketsAvailableResult = [
{event_id: 'eventId1', total_tickets: 10, total_tickets_sold: 2},
{event_id: 'eventId2', total_tickets: 23, total_tickets_sold: 12},
{event_id: 'eventId3', total_tickets: 14, total_tickets_sold: 8},
]
*/
for (const numberOfTicketAvailable of ticketsAvailableResult) {
const { event_id, total_tickets, total_tickets_sold } =
numberOfTicketAvailable;
await EventsModel.update(
{
number_of_ticket_types: total_tickets - total_tickets_sold,
},
{
where: {
id: event_id,
},
}
);
}
// Run update query...
}
What's Wrong with This Approach?
While this method works, it’s quite verbose and involves multiple database queries:
- A query to retrieve ticket_type_count for each event.
- A query to retrieve total_tickets and total_tickets_sold for each event.
- An
UPDATE
query for each event that has been created in the database. This is a really expensive task if many items have been created in the database previously
In the above example the total number of dB calls is dependent on the total number of events that exists in the application.
Which would greatly slow down our server
Note: If we use the techniques from the previous article(Senerio 1), we would be able to reduce the number of
UPDATE
query calls to 1 per function. Which would reduce the total queries to four
A Better Solution
A more efficient way to handle this problem is to use a single SQL query to perform both updates in one go. We can achieve this using a WITH clause (common table expression, or CTE) and a single UPDATE query.
To learn more about CTEs, check out [this article(https://dev.to/chidioguejiofor/whats-with-cte-53ao)]
Here’s how we can do it:
WITH update_cte AS (
SELECT
event_id,
COUNT(id) AS ticket_type_count,
SUM(quantity) - SUM(quantity_sold) AS number_of_tickets_available
FROM
tickets
GROUP BY
event_id
)
UPDATE
events
SET
number_of_ticket_types = update_cte.ticket_type_count,
number_of_tickets_available = update_cte.number_of_tickets_available
FROM
update_cte
WHERE
update_cte.event_id = events.id;
Explanation
- The
WITH update_cte
clause calculates both theticket_type_count
andnumber_of_tickets_available
in one go usingCOUNT
andSUM
functions, grouped byevent_id
. - In the
UPDATE
statement, we join the events table with the CTE (update_cte
) based on event_id. - The fields in events (
number_of_ticket_types
andnumber_of_tickets_available
) are updated with the values from the CTE.
Why Is This Better?
This approach performs both updates in a single query, eliminating the need for multiple database queries and significantly improving performance. Instead of managing two separate functions and multiple steps, we only need to maintain one query that efficiently backfills the data.
Additionally, the same concept can be applied to future backfills if similar aggregate fields are added to the events table.
Conclusion
Backfilling data after adding new fields to an existing table is a common task in application development. While the initial approach of making multiple queries works, it can be optimized significantly using SQL’s powerful aggregation and CTE features. By consolidating updates into a single query, we reduce database load, improve performance, and make the code more maintainable.