🏺Database Architecture - History Over State

Randall - Apr 2 '23 - - Dev Community

What should a database store? There are two broad ways of thinking about it:

  1. A database should store state
  2. A database should store history

Engineers tend to gravitate towards treating their database as a store of state. It is the more obvious approach, and it tends to be a bit simpler.

But it turns out that treating your database as a store of history instead can make your applications much more flexible and resilient. Let us take a look...

What's the difference?

Let us pretend we have a simple web game and we want to display a leaderboard to our users. Each match has one or more players, and each player earns zero or more points. We want to show a ranking of players by how many points they have earned in total:

Username Points
John 150
Ife 125
Amir 98
Jane 50

If you treat the database as a state store

The engineer who treats their database as a state store simply makes a table to store the state of the leaderboard:

CREATE TABLE leaderboard (
  username VARCHAR UNIQUE NOT NULL PRIMARY KEY,
  points INTEGER NOT NULL
);

CREATE INDEX points_idx ON leaderboard (points DESC);
Enter fullscreen mode Exit fullscreen mode

They start updating this new table whenever a player earns points, and they run a simple query on it to get the top 20 players:

SELECT *
FROM leaderboard
ORDER BY points DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

If you treat the database as a history store

The engineer who treats their database as a history store approaches this much differently. They do not store a leaderboard state. Rather, they store a match history. They create two tables like this:

CREATE TABLE "match" (
  match_id INTEGER UNIQUE NOT NULL PRIMARY KEY,
  start_time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  end_time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  map_name VARCHAR NOT NULL,
  game_mode VARCHAR NOT NULL
);

CREATE TABLE match_player (
  match_id INTEGER NOT NULL,
  username VARCHAR NOT NULL,
  points INTEGER NOT NULL,
  FOREIGN KEY (match_id) REFERENCES "match" (match_id)
);
Enter fullscreen mode Exit fullscreen mode

When a match happens, the relevant information is inserted into the match and match_player tables. Now, to calculate the leaderboard, the engineer queries the database like so to aggregate all of the match history and dynamically generate a leaderboard:

SELECT username, SUM(points)
FROM match_player
GROUP BY username
ORDER BY SUM(points) DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

Rather than thinking "I need to store a leaderboard in the database", this engineer thinks "I need to store as much history and useful information in the database as I can. Oh and as an aside, that data needs to be at least sufficient to generate a leaderboard".

Perhaps you noticed that their leaderboard query does not even use the match table (it only uses match_player). Despite not needing game_mode and the other data there immediately, this engineer decided that it is an important part of the application's history, and therefore should be stored.

To make querying a little easier, this engineer might decide to create a leaderboard view, which is basically a saved query that can be queried like a table:

CREATE VIEW leaderboard AS (
  SELECT username, SUM(points)
  FROM match_player
  GROUP BY username
  ORDER BY SUM(points) DESC
  LIMIT 20
);

SELECT * FROM leaderboard;
Enter fullscreen mode Exit fullscreen mode

In come the bugs and new features

The bug - One day you discover that a bug in your game is being exploited, and some users have been able to score an unfair number of points. You look back at your git history and determine that the bug was introduced on April 4. You look back in your logs and find some match IDs where the bug was certainly exploited, and you also determine that if anyone was able to score more than 50 points in a single match, they must have been exploiting.

What do you do? If you only have a stateful leaderboard table, the situation is more difficult. Somehow you have to try to determine which users have exploited the bug. Then what? Delete them from the leaderboard entirely? Deduct some of their points? How many?

For the engineer who stored match history, the situation is easier. They can delete the bugged matches, and any that happened after April 4 where a player scored more than 50 points. Problem solved.

The new feature - One day you decide you want to create monthly leaderboards. If you went with the state-store architecture, you need to create at least one entirely new table, and you will not be able to show leaderboards for months that have already passed. If you went with the history-store architecture, you already have all of the data you need retroactively, and you just need to write a new query.

We can never know what the future will hold, but as we can see, the engineer who treats their database as a history store is much more equipped to face whatever requirements come their way!

Performance

One might point out that constantly aggregating over an entire table of match history to compute a leaderboard is not efficient. While this approach scales further than you might think, at some point it really does become too much. What should we do then?

Fortunately, we do not need to ditch the history-centric architecture. There are at least a couple of big guns we can bring out to greatly improve performance:

Materialized Views

Some databases, such as PostgreSQL, have first-class support for a concept referred to as materialized views.

A materialized view is a table whose contents are computed based on a query. If you are familiar with regular (non-materialized) views, the main difference is that a materialized view is stored on disk. The contents of a materialized view are only re-computed when requested, via running the REFRESH MATERIALIZED VIEW statement.

In the leaderboard example, the engineer who went with the history-store architecture can create a materialized view exactly like they created the regular view, just by adding the MATERIALIZED keyword:

CREATE MATERIALIZED VIEW leaderboard AS (
  SELECT username, SUM(points)
  FROM match_player
  GROUP BY username
  ORDER BY SUM(points) DESC
  LIMIT 20
);
Enter fullscreen mode Exit fullscreen mode

And they can query this leaderboard materialized view exactly like they would query a table. Even if the match_player table has a huge number of records, fetching the top 20 scorers will be fast, because the rows in the materialized view are pre-computed and stored on disk.

To recompute the leaderboard every hour, you could use pg_cron to set up a cron job in the database:

CREATE EXTENSION IF NOT EXISTS pg_cron;

SELECT cron.schedule(
  'leaderboard-refresh',
  '0 * * * *',
  'REFRESH MATERIALIZED VIEW CONCURRENTLY leaderboard'
);
Enter fullscreen mode Exit fullscreen mode

(Your PostgreSQL installation might already have pg_cron available, but if not, you would need to install it)

Then your leaderboard will automatically update every hour.

An obvious shortcoming here is that the leaderboard data can be up to one hour stale. Another is that, even if we only refresh the view once per hour, there is still some point at which that might become unsustainable in terms of performance (it would likely take many billions of matches before we get there, but this might be more relevant if you want to update the leaderboard say, every five seconds).

Despite these shortcomings, a materialized view is often a great solution. But if you need completely up-to-date data, or you have an un-utterably huge amount of data to aggregate over, the next technique might serve you better...

Incremental View Maintenance (IVM)

Thinking about it logically, when a player wins a match, do we really need to re-compute the entire leaderboard? Couldn't we just find that user's leaderboard record, and update their points?

Yes indeed we can! This approach is sometimes referred to as Incremental View Maintenance. It has been proposed as a first-class feature in PostgreSQL, but is not yet shipped. However we can implement it ourselves.

The naive way to do it is to create a leaderboard table, then whenever you insert a new record into the match_player table, you also have code to do an update on the leaderboard table to update the participants' points.

This can be an okay approach, but it requires vigilance to make sure your code always updates the leaderboard table when it updates the match_player table. A better way is to use triggers. You can write a trigger function which runs automatically whenever you update the match_player table. Whether you add new match records, delete records, or update records, your trigger function can update the leaderboard table appropriately. Best of all, the trigger function lives in the database itself, and the database takes care of running it whenever the match_player table changes. A trigger function for this scenario could look like this:

CREATE OR REPLACE FUNCTION update_leaderboard_function()
RETURNS TRIGGER AS $$
BEGIN
  IF (TG_OP = 'INSERT')
  THEN
    -- If a new match_player record is being inserted, we
    -- do an upsert on the leaderboard table to increase
    -- the points of the leaderboard record corresponding
    -- to that username in the new match record.
    -- IMPORTANT: The leaderboard table must have a unique
    -- index on the username column
    INSERT INTO leaderboard (username, points) 
    VALUES (NEW.username, NEW.points)
    ON CONFLICT (username)
    DO UPDATE SET points = leaderboard.points + NEW.points;
  ELSEIF (TG_OP = 'DELETE')
  THEN
    -- If a match_player record is being deleted, we update
    -- the leaderboard table and decrease the leaderboard
    -- points of the user whose match record was deleted.
    UPDATE leaderboard
    SET points = leaderboard.points - OLD.points
    WHERE leaderboard.username = OLD.username;
  ELSEIF (TG_OP = 'UPDATE')
  THEN
    -- If a match_player record is being updated, we calculate
    -- the difference in points between the new record
    -- and the old record and add that to the points of
    -- the user whose match record was updated (the difference
    -- might be negative, in which case the user's position
    -- on the leaderboard would go down)
    UPDATE leaderboard
    SET points = leaderboard.points + (NEW.points - OLD.points)
    WHERE leaderboard.username = OLD.username;
  END IF;

  RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER update_leaderboard_trigger
AFTER INSERT OR UPDATE OR DELETE ON match_player
FOR EACH ROW
EXECUTE FUNCTION update_leaderboard_function();
Enter fullscreen mode Exit fullscreen mode

That might look big and scary if you haven't worked with triggers before, but what it does is pretty simple:

  1. If you INSERT into the match_player table, it adds points to the leaderboard table.
  2. If you DELETE from the match_player table, it deducts points from the leaderboard table.
  3. If you UPDATE the match_player table, it calculates the change in points (since the function has access to both the NEW and OLD version of the row) and applies the difference to the leaderboard table.

So this trigger automatically keeps your leaderboard up to date in real time! (as long as you do not TRUNCATE the match_player table, at least).

By using materialized views and/or IVM, you get more or less the best of both worlds. You get performance that's nearly as good as the simpler state-store architecture, while having the flexibility of the history-store architecture. These approaches do kind of let some statefulness creep back in, but it can be worth it for the performance gains, and the history still remains the ultimate source of truth.

Conclusion

Should a database store state, or should it store history and compute state from history? It turns out that the history-store architecture holds a lot of really compelling advantages. Personally, it is how I always design database schemas now, although I certainly do make compromises in some scenarios.

What do you think? Are you familiar with both of these architectures? Is there a third, fourth, or fifth that we should be talking about?

I took a few shortcuts to cut down on a line of code here and there, such as not always specifying primary keys, and using username as a unique ID. These should not be considered good practices.

. . . . . . . . . . . . . . . . . . . . . . .