Making your relational database a little more graphy

Arik - Jun 21 '19 - - Dev Community

When it comes to storing data, relational database had always been my go-to solution. They're reliable, they're mature, their Structured Query Language (or SQL) is quite powerful and if you design and treat them well, they generally provide great performance.

In relational databases, everything is a table. So for example if you want to store -- say a list of customers -- then you will probably create a customers table where these will be stored.

For each table in your database you would also define the pieces of information you want to store on each item in the table. These pieces of information are technically known as "fields".

Here's an example of a very simple customers table:

CREATE TABLE customers (
  id int,
  last_name varchar(255),
  first_name varchar(255),
  email_address varchar(255)
);
Enter fullscreen mode Exit fullscreen mode

If you also want to store -- say, the customer orders -- you can create an orders table where each item (or row) in the table would be an order with a link (or relation) to the customer who made the order.

CREATE TABLE orders (
  id int,
  customer_id int,
  order_date date,
  order_status varchar(10),
  order_total decimal
);
Enter fullscreen mode Exit fullscreen mode

That second field -- customer_id -- is the id of the customer in the customers table and is what creates the relation between items in the orders table to items in the customers table.

This is a great way to model your data. And in fact many many software projects are built in just this way.

More recently a new type of database started to become more mainstream: Graph Database. Graph databases take a slightly different approach to thinking/modeling your data.

Rather than thinking of your data as a series of tables, it encourages you to think about your data as a network of objects. So a typical example would be your Facebook/Linkedin network. You are connected to someone, that person is connected in turn to someone else and so on.

In graph databases there are really only two types of things:

  1. Vertex/Node - the fundamental unit which make up the elements of your graph. So a "Friend" in the Facebook example would be a node/vertex.

  2. Edge - acts as a link between nodes/vertices.

Here's an example of what this might look like:

So the yellow circles are nodes/vertices and the lines are the edges.

On the surface this might seem pretty similar to the relational database idea of relations. That's true so long as your network of objects isn't too "wide". Once you have many "degrees of relations" -- i.e. an object links to an object which links to another one and so on -- SQL queries can become somewhat unwieldy to deal with.

This was all interesting to me when I first learned about it, but because I didn't have any practical application for it at the time, I parked it at the back of my head.

And then one day I was faced with an interesting problem I hadn't had to deal with in the past: building a Content Management System. Let's leave aside for a second the build vs. not build a CMS discussion. This is not the point of this article. Let's just say that we had no choice but to build one.

This thing with CMSes though, is that unlike traditional customers/orders type databases, they typically have a much richer set of objects. So in our case, being in the television industry, we had to store films and episodes and series and videos and things like that. But we also wanted to drive the entire experience on the screen straight from our CMS.

That means that we wanted to give our editors the power to define how the homepage would look like for various types of users at a certain point in time. Or we wanted them to schedule certain collection of movies that would appear on the homepage (think Netflix type of experience). Or we wanted certain collection of movies to be driven automatically based on the user interaction with the website.

Looking over this data model it was clear that we had a fairly complex network of objects in our hands. It was at this point that I recalled that graph database thing I parked at the back of my head. This seems like a perfect fit for it.

I was sold on the idea of using a "graphy" approach but I wasn't sold about the idea of using anything but my venerable relational databases. None of the "real" graph databases seemed to have the level of maturity and robustness I learned to love in the many years of using a relational database.

So I wondered if there was a way for me to eat my cake and have it too. Or use a relational database but as a graph database. There were a couple of things I had to solve in order to make this happen:

  1. How do I store the data?
  2. How do I query the data without writing complicated and therefore brittle SQL queries.
  3. What about performance?

The first problem ended up being relatively easy to solve:

create table node (
  id          varchar(64) primary key,
  node_type   varchar(64) not null,
  created_at  timestamp not null default current_timestamp,
  modified_at timestamp,
  properties  jsonb not null,
  deleted     boolean not null default false
);

create table edge (
  id           varchar(64) primary key,
  edge_type    varchar(64) not null,
  created_at   timestamp not null default current_timestamp,
  modified_at  timestamp,
  from_node_id varchar(64) not null references node (id),
  to_node_id   varchar(64) not null references node (id),
  properties   jsonb not null,
  deleted      boolean not null default false
);
Enter fullscreen mode Exit fullscreen mode

Using just two tables I was able to get any conceivable object stored in the node table and be able to arbitrarily connect it to any other object using the edge table. Also, using PostgreSQL's jsonb table I did away with the need to alter my schema ever again to add custom properties/fields.

To solve the second problem, I wrote a pretty thin wrapper around my SQL library to allow me to query my graph of objects in a "graphy" way:

 // list all actors for the home alone movie
 Traversal<Node> t = g.nodes().type("Movie").eq("title","Home Alone").to("Actor");
Enter fullscreen mode Exit fullscreen mode

For the third problem, I had to use a little bit of caching magic on the application side. But since the data model is really simple (2 tables) it ended up being pretty straightforward to accomplish.

The solution I just described here ended up being used to power every aspect of our website and mobile apps: https://watch.smithsonianchannel.com

. . . . . . . . . . . . . . . . . . .