Making Common Table Expression SQL More Railsy

Michael Chaney - Jun 30 - - Dev Community

In our last episode, I talked about using the common table expression syntax to make a query run much faster and allow me to insert and query the new records at the same time.

Starting in Rails 7.1, it's now possible to add common table expressions to ActiveRecord relations. I can rewrite the query to use some of the good parts of ActiveRecord and hopefully make the code a little more readable.

As a reminder, I'm working with these three tables:

class RawRoyaltyRecord < ApplicationRecord
  belongs_to :royalty_input_batch_partial
  belongs_to :track
  has_many :raw_royalty_records_sales
end

class RoyaltyInputBatchPartial < ApplicationRecord
  belongs_to :pro
  has_many :raw_royalty_records
end

class RawRoyaltyRecordsSale < ApplicationRecord
  belongs_to :raw_royalty_record
  belongs_to :sale
end
Enter fullscreen mode Exit fullscreen mode

Ultimately, this is the big SQL query that inserts the new records and returns them:

WITH eligible_records AS ( -- This gets a list of existing track_id, customer, and sale_ids
    SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer, rrrs.sale_id
    FROM raw_royalty_records rrr
    INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
    INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
    WHERE ribp.pro_id = 960
      AND rrr.track_id IS NOT NULL
),
inserted_records AS (
    INSERT INTO raw_royalty_records_sales (raw_royalty_record_id, sale_id, created_at, updated_at)
    SELECT DISTINCT rr.id, er.sale_id, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
    FROM raw_royalty_records rr
      INNER JOIN royalty_input_batch_partials ribp ON rr.royalty_input_batch_partial_id = ribp.id
      LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rr.id
      INNER JOIN eligible_records er ON er.track_id = rr.track_id and er.lower_customer = LOWER(rr.customer)
    WHERE ribp.pro_id = 960
      AND rrs.id IS NULL
      AND rr.track_id IS NOT NULL
    RETURNING *
)
SELECT ir.raw_royalty_record_id, ir.sale_id, rrr.track_id, rrr.customer
  FROM inserted_records ir INNER JOIN raw_royalty_records rrr
         ON rrr.id=ir.raw_royalty_record_id
Enter fullscreen mode Exit fullscreen mode

The only piece of information that I need to pass in there is the pro_id, which is "960" in the example. It would be easy to use to use select_all in my Ruby code to accomplish this, but using ActiveRecord will be fun.

There is, however, one interesting limitation with ActiveRecord. I have to somehow base this entire query on a table if I want to be able to use the various methods to build the query. It's not a problem here because our actual query pulls from raw_royalty_records. I just need to add a join with the CTE inserted_records from above.

One other issue is that the CTEs need to be somehow built. The first one is pretty straightforward. But the second one is a weird insert and all that, I'm going to be creating some hand-coded SQL there or something.

Putting together a query like this requires starting from the end. Here's the final query:

SELECT ir.raw_royalty_record_id, ir.sale_id, rrr.track_id, rrr.customer
  FROM inserted_records ir INNER JOIN raw_royalty_records rrr
         ON rrr.id=ir.raw_royalty_record_id
Enter fullscreen mode Exit fullscreen mode

Let's change that around so that we're pulling from raw_royalty_records:

SELECT rrr.id, ir.sale_id, rrr.track_id, rrr.customer
FROM raw_royalty_records rrr INNER JOIN inserted_records it
  ON rrr.id=ir.raw_royalty_record_id
Enter fullscreen mode Exit fullscreen mode

Now, we can use ActiveRecord to build this:

query = RawRoyaltyRecord.joins("INNER JOIN inserted_records on inserted_records.raw_royalty_record_id = raw_royalty_records.id")
Enter fullscreen mode Exit fullscreen mode

Of course, that's invalid as-is because "inserted_records" isn't a table. But if I look at the generated SQL using the to_sql method, we're on the right track:

SELECT "raw_royalty_records".* FROM "raw_royalty_records" INNER JOIN inserted_records on inserted_records.raw_royalty_record_id = raw_royalty_records.id
Enter fullscreen mode Exit fullscreen mode

So, we need to add our CTEs to this query, and we do that using the with method. We can add them both at the same time, but I'm going to do one at a time because they're different and we need to look at those differences.

Here's the first CTE:

WITH eligible_records AS (
    SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer, rrrs.sale_id
    FROM raw_royalty_records rrr
    INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
    INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
    WHERE ribp.pro_id = 960
      AND rrr.track_id IS NOT NULL
)
Enter fullscreen mode Exit fullscreen mode

Turns out, we can use standard ActiveRecord to put this together.

   eligible_records_query =
     RawRoyaltyRecord
       .joins(:royalty_input_batch_partial)
       .where("royalty_input_batch_partials.pro_id": 960)
       .joins(:raw_royalty_records_sales)
       .where("raw_royalty_records.track_id is not null")
       .select(:track_id, "LOWER(raw_royalty_records.customer) AS lower_customer", "raw_royalty_records_sales.sale_id")
       .distinct
Enter fullscreen mode Exit fullscreen mode

With that, we can add it as a CTE to our query:

  query = query.with(eligible_records: eligible_records_query)
Enter fullscreen mode Exit fullscreen mode

It's actually really cool to be able to see this work in parts, and common table expressions built this way allow us to easily break the query up and test the individual parts.

The query above stands on its own. But how can we test it as a CTE?

Our original giant query had two CTEs as you may recall, with the second one being the insert. But we can pull it apart and use the select part of it to see this query working as a CTE.

But, first, let's make it really simple. Here's the simplest way you can test a CTE:

WITH eligible_records AS ( -- This gets a list of existing track_id, customer, and sale_ids
    SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer, rrrs.sale_id
    FROM raw_royalty_records rrr
    INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
    INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
    WHERE ribp.pro_id = 960
      AND rrr.track_id IS NOT NULL
)
SELECT * FROM eligible_records;
Enter fullscreen mode Exit fullscreen mode

How do we rubyize this? Not easily, because eligible_records doesn't exist outside of this query. But let's revisit the original giant query, specifically the second CTE:

inserted_records AS (
    INSERT INTO raw_royalty_records_sales (raw_royalty_record_id, sale_id, created_at, updated_at)
    SELECT DISTINCT rr.id, er.sale_id, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
    FROM raw_royalty_records rr
      INNER JOIN royalty_input_batch_partials ribp ON rr.royalty_input_batch_partial_id = ribp.id
      LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rr.id
      INNER JOIN eligible_records er ON er.track_id = rr.track_id and er.lower_customer = LOWER(rr.customer)
    WHERE ribp.pro_id = 960
      AND rrs.id IS NULL
      AND rr.track_id IS NOT NULL
    RETURNING *
)
Enter fullscreen mode Exit fullscreen mode

Let's chop out the select statement and use it as the primary query to find some raw_royalty_records:

WITH eligible_records AS ( -- This gets a list of existing track_id, customer, and sale_ids
    SELECT DISTINCT rrr.track_id, LOWER(rrr.customer) AS lower_customer, rrrs.sale_id
    FROM raw_royalty_records rrr
    INNER JOIN royalty_input_batch_partials ribp ON ribp.id = rrr.royalty_input_batch_partial_id
    INNER JOIN raw_royalty_records_sales rrrs ON rrrs.raw_royalty_record_id = rrr.id
    WHERE ribp.pro_id = 960
      AND rrr.track_id IS NOT NULL
)
SELECT DISTINCT rrr.id, er.sale_id, rrr.customer
    FROM raw_royalty_records rrr
      INNER JOIN royalty_input_batch_partials ribp ON rrr.royalty_input_batch_partial_id = ribp.id
      LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rrr.id
      INNER JOIN eligible_records er ON er.track_id = rrr.track_id and er.lower_customer = LOWER(rrr.customer)
    WHERE ribp.pro_id = 960
      AND rrs.id IS NULL
      AND rrr.track_id IS NOT NULL
Enter fullscreen mode Exit fullscreen mode

That's a simplified version of the original giant query which will give us a list of raw_royalty_record ids along with sales ids. While we're at it, we'll grab the customer as well so we can see what's matching up.

Now, with the query in this shape, it's time to Rubyize it. We already have the CTE Rubyized, let's work on the main query.

With ActiveRecord, we have to start from a model. For this, we're pulling in ids from two different tables, but raw_royalty_records is the primary. I'll base this on RawRoyaltyRecord:

# See above for "eligible_records_query"
query = RawRoyaltyRecord
.joins(:royalty_input_batch_partial)
.where("royalty_input_batch_partials.pro_id": 960)
.joins("INNER JOIN eligible_records
  ON eligible_records.track_id = raw_royalty_records.track_id
    AND eligible_records.lower_customer = LOWER(raw_royalty_records.customer)")
.left_joins(:raw_royalty_records_sales)
.where("raw_royalty_records_sales.id is null")
.where("raw_royalty_records.track_id is not null")
.with(eligible_records: eligible_records_query)
.select("raw_royalty_records.id", "eligible_records.sale_id", "raw_royalty_records.customer")
.distinct
Enter fullscreen mode Exit fullscreen mode

At this point, there's an argument that can be made that the Ruby code is more complicated than the SQL. And, in a way, it is. But it also is safe since we're bringing in a possibly dangerous parameter (the pro_id of "960) and still takes care of a lot of the grunt work of joining tables and all that. I think that it's easier to read. But those are opinions, nothing more.

Back to the original issue - How do I automatically add these records in a CTE and then get them back out? Is it worth the trouble?

Let's figure out how to do it first. The issue that we have is that the main query above needs to be changed:

  1. remove "customer" as it won't be needed
  2. add created_at and updated_at
  3. remove the now-extraneous with as I'll keep them both at top level

At that point, it'll be ready to insert into raw_royalty_records_sales. The problem is that it's difficult to turn the select into an insert using standard ActiveRecord.

Again, here's the part of the query that performs the insert and select as a common table expression:

    INSERT INTO raw_royalty_records_sales (raw_royalty_record_id, sale_id, created_at, updated_at)
    SELECT DISTINCT rr.id, er.sale_id, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
    FROM raw_royalty_records rr
      INNER JOIN royalty_input_batch_partials ribp ON rr.royalty_input_batch_partial_id = ribp.id
      LEFT OUTER JOIN raw_royalty_records_sales rrs ON rrs.raw_royalty_record_id = rr.id
      INNER JOIN eligible_records er ON er.track_id = rr.track_id and er.lower_customer = LOWER(rr.customer)
    WHERE ribp.pro_id = 960
      AND rrs.id IS NULL
      AND rr.track_id IS NOT NULL
    RETURNING *
Enter fullscreen mode Exit fullscreen mode

This is basically the same select as above, only it's in the CTE. We can make this work.

Let's start with the final query. Ultimately, I'm going to pull in this information, which I'm just using to log the transaction:

SELECT rrr.id, ir.sale_id, rrr.track_id, rrr.customer
FROM raw_royalty_records rrr INNER JOIN inserted_records it
  ON rrr.id=ir.raw_royalty_record_id
Enter fullscreen mode Exit fullscreen mode

This is a denormalized list of sales, tracks, and customers from the raw_royalty_records and friends tables (it's possible for one raw_royalty_record to be tied to multiple sales because the reporting doesn't give enough information sometimes).

select_for_insert_query = RawRoyaltyRecord
.joins(:royalty_input_batch_partial)
.where("royalty_input_batch_partials.pro_id": 960)
.joins("INNER JOIN eligible_records
  ON eligible_records.track_id = raw_royalty_records.track_id
    AND eligible_records.lower_customer = LOWER(raw_royalty_records.customer)")
.left_joins(:raw_royalty_records_sales)
.where("raw_royalty_records_sales.id is null")
.where("raw_royalty_records.track_id is not null")
.select("raw_royalty_records.id AS raw_royalty_record_id", "eligible_records.sale_id AS sale_id", "CURRENT_TIMESTAMP AS created_at", "CURRENT_TIMESTAMP AS updated_at")
.distinct
Enter fullscreen mode Exit fullscreen mode

This is maybe a little ugly, but I can take the SQL from that, wrap it with INSERT INTO...RETURNING *, and I've got a CTE.

One note here - the CTE cannot just be a string. In this case, we'll just have to wrap the string in Arel.sql:

insert_query = Arel.sql(
  "INSERT INTO raw_royalty_records_sales (raw_royalty_record_id, sale_id, created_at, updated_at)\n" +
  select_for_insert_query.to_sql +
  "\nRETURNING *"
)
Enter fullscreen mode Exit fullscreen mode

With that, I can add both CTEs to a basic query. I'll call that one "inserted_records".

query = RawRoyaltyRecord.joins("
  INNER JOIN inserted_records
    ON inserted_records.raw_royalty_record_id = raw_royalty_records.id
  ")
      .select("raw_royalty_records.id",
              "inserted_records.sale_id",
              "raw_royalty_records.track_id",
              "raw_royalty_records.customer")
      .distinct
      .with(
         eligible_records: eligible_records_query,
         inserted_records: insert_query
      )
Enter fullscreen mode Exit fullscreen mode

And with that, this monstrosity works.

WITH "eligible_records" AS (SELECT DISTINCT "raw_royalty_records"."track_id", LOWER(raw_royalty_records.customer) AS lower_customer, "raw_royalty_records_sales"."sale_id" FROM "raw_royalty_records" INNER JOIN "royalty_input_batch_partials" ON "royalty_input_batch_partials"."id" = "raw_royalty_records"."royalty_input_batch_partial_id" INNER JOIN "raw_royalty_records_sales" ON "raw_royalty_records_sales"."raw_royalty_record_id" = "raw_royalty_records"."id" WHERE "royalty_input_batch_partials"."pro_id" = 960 AND (raw_royalty_records.track_id is not null)), "inserted_records" AS (INSERT INTO raw_royalty_records_sales (raw_royalty_record_id, sale_id, created_at, updated_at)
WITH "eligible_records" AS (SELECT DISTINCT "raw_royalty_records"."track_id", LOWER(raw_royalty_records.customer) AS lower_customer, "raw_royalty_records_sales"."sale_id" FROM "raw_royalty_records" INNER JOIN "royalty_input_batch_partials" ON "royalty_input_batch_partials"."id" = "raw_royalty_records"."royalty_input_batch_partial_id" INNER JOIN "raw_royalty_records_sales" ON "raw_royalty_records_sales"."raw_royalty_record_id" = "raw_royalty_records"."id" WHERE "royalty_input_batch_partials"."pro_id" = 960 AND (raw_royalty_records.track_id is not null)) SELECT DISTINCT raw_royalty_records.id AS raw_royalty_record_id, eligible_records.sale_id AS sale_id, CURRENT_TIMESTAMP AS created_at, CURRENT_TIMESTAMP AS updated_at FROM "raw_royalty_records" INNER JOIN "royalty_input_batch_partials" ON "royalty_input_batch_partials"."id" = "raw_royalty_records"."royalty_input_batch_partial_id" LEFT OUTER JOIN "raw_royalty_records_sales" ON "raw_royalty_records_sales"."raw_royalty_record_id" = "raw_royalty_records"."id" INNER JOIN eligible_records
  ON eligible_records.track_id = raw_royalty_records.track_id
    AND eligible_records.lower_customer = LOWER(raw_royalty_records.customer) WHERE "royalty_input_batch_partials"."pro_id" = 960 AND (raw_royalty_records_sales.id is null) AND (raw_royalty_records.track_id is not null)
RETURNING *) SELECT DISTINCT "raw_royalty_records"."id", "inserted_records"."sale_id", "raw_royalty_records"."track_id", "raw_royalty_records"."customer" FROM "raw_royalty_records" INNER JOIN inserted_records
    ON inserted_records.raw_royalty_record_id = raw_royalty_records.id
Enter fullscreen mode Exit fullscreen mode

But is it worth it? I can short-cut this and return the raw_royalty_records_sales, or I can do it the way I've done it above. All I want the records for is to log them, and my preference is to log them in the format:

track id (and maybe title)
customer
list of new raw_royalty_record ids
list of sale ids (and maybe the customer info attached to the sales for confirmation)

Given the denormalized nature of the data, I'm still going to have to do some magic in Ruby to get this log format. I could simply use pluck to pull the somewhat raw data out and handle it myself.

The cool thing about using the railsy way of building these queries is that I can swap raw_royalty_records out for raw_royalty_records_sales easily:

query = RawRoyaltyRecordsSale
      .from("inserted_records")
      .joins("INNER JOIN raw_royalty_records ON raw_royalty_records.id=inserted_records.raw_royalty_record_id")
      .select("inserted_records.id",
              "inserted_records.raw_royalty_record_id",
              "inserted_records.sale_id",
              "raw_royalty_records.track_id",
              "raw_royalty_records.customer")
      .distinct
      .with(
         eligible_records: eligible_records_query,
         inserted_records: insert_query
      )
Enter fullscreen mode Exit fullscreen mode

One issue with this is that - regardless of how I do it - find_each is not going to work here. So I have to be realistic about the amount of data that I'm loading in. With the PRO that I'm testing with I have around 4000 created records. Not too many, definitely more than I want to risk an N+1 with. Plus, it's difficult to say how this may progress in the future as the database grows.

Anyway, that's how to use a common table expression to insert new records while simultaneously instantiating objects based on them.

. . .