TTL in YugabyteDB with batched deletes

Franck Pachot - Oct 24 '23 - - Dev Community

Many databases receive data that may be required for a specific retention and can be deleted later. Some NoSQL databases can set a Time-To-Live (TTL) for data, automatically deleting old rows. YugabyteDB provides TTL functionality for YCQL, the Cassandra-like API, and removes rows that have passed their TTL during SST File compaction. However, in the case of YSQL, the PostgreSQL-compatible API, this cannot be implemented similarly because SQL is more complex, with secondary indexes and foreign keys. Deleting them independently would break consistency. One solution is to use SQL declarative partitioning, where one partition is created for each month, for example. Dropping a table partition is fast and stays consistent as it drops the corresponding local index partitions. However, declarative partitioning has limitations, such as having only local indexes.

To keep it simple without premature optimization, the first solution that comes to mind is simply running a batch job that deletes the old rows. This blog post shows the solution.

I'll use pgbench where the pgbench_history table has a mtime column. To be more realistic, I'll add a primary key to the table, a generated UUID.

\! pgbench -iIdtpG

alter table pgbench_history add column id uuid ;
alter table pgbench_history alter column id set default gen_random_uuid();
update pgbench_history set id=default where id is null;
alter table pgbench_history add primary key(id);

Enter fullscreen mode Exit fullscreen mode

I run pgbench simple update (-N) with 50 connections, showing the progress every 15 seconds:

\! pgbench -c 50 -T86400 -nNP15 --max-tries 10 
Enter fullscreen mode Exit fullscreen mode

I'll purge rows from pgbench_history after one minute. To run small batches, I generate 512 queries on ranges of yb_hash_code(id). Because the primary key is id HASH, thise will provide fast access to a range of rows. I've underscored the parts you may configure for your use case in the query below. The generated queries are run with \gexec:


prepare ttl_purge(int) as
with ttl as (
 select (now() - interval '1 minute' )::timestamptz as retention
                 ------------------
) delete from pgbench_history
              ---------------
 where yb_hash_code(id) between $1  and $2
 and mtime < ( select retention from ttl)
     -----
;

select format(
 'execute ttl_purge(%s,%s)'
 , n , coalesce(lead(n)over(order by n),65536)-1
) from (
 select n*(65536/count(*)over()) n
 from generate_series( 0, 512-1 ) n
                          ---
) ranges

\gexec

Enter fullscreen mode Exit fullscreen mode

You can run this in a loop to delete the rows that passed their Time-To-Live.

I always look at the execution plan to check the time complexity:

yugabyte=> explain (analyze, costs off, summary off, dist)
           execute ttl_purge(65024,65535);

                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Delete on pgbench_history (actual time=7.429..7.429 rows=0 loops=1)
   CTE ttl
     ->  Result (actual time=0.002..0.002 rows=1 loops=1)
           Storage Table Read Requests: 1
           Storage Table Read Execution Time: 6.628 ms
   InitPlan 2 (returns $1)
     ->  CTE Scan on ttl (actual time=0.006..0.006 rows=1 loops=1)
   ->  Index Scan using pgbench_history_pkey on pgbench_history (actual time=6.739..7.049 rows=158 loops=1)
         Index Cond: ((yb_hash_code(id) >= 65024) AND (yb_hash_code(id) <= 65535))
         Filter: (mtime < $1)
         Rows Removed by Filter: 119
         Storage Table Write Requests: 158
(12 rows)
Enter fullscreen mode Exit fullscreen mode

If you have a range index on mtime, it can be used:

create index pgbench_history_mtime 
 on pgbench_history ( mtime desc, (yb_hash_code(id)) );

Enter fullscreen mode Exit fullscreen mode

Always check the execution plan to see if it is more efficient.

                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Delete on pgbench_history (actual time=91.834..91.834 rows=0 loops=1)
   CTE ttl
     ->  Result (actual time=0.148..0.149 rows=1 loops=1)
   InitPlan 2 (returns $1)
     ->  CTE Scan on ttl (actual time=0.151..0.151 rows=1 loops=1)
   ->  Index Scan using pgbench_history_mtime on pgbench_history (actual time=91.250..91.314 rows=52 loops=1)
         Index Cond: ((mtime < $1) AND (yb_hash_code(id) >= $1) AND (yb_hash_code(id) <= $2))
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 0.799 ms
         Storage Index Read Requests: 1
         Storage Index Read Execution Time: 86.566 ms
         Storage Table Write Requests: 52
         Storage Index Write Requests: 52
(13 rows)

Enter fullscreen mode Exit fullscreen mode

There's no single recommendation about indexing and batch size because it depends on your workload, and the volume to purge. If the index is not used for another use case, maintaining it during inserts and deletes may not be a good idea.

With a range index on mtime you may also run a delete with a limit:

explain (analyze, costs off, summary off, dist)
/*+ Set( yb_bnl_batch_size 1024) */
with ttl as (
 select (now() - interval '1 minute' )::timestamptz as retention
                 ------------------
), delete_me as (
 select id from pgbench_history
                ---------------
 where mtime < ( select retention from ttl)
 limit 10000
) delete from pgbench_history
 where id in (select id from delete_me)
;

Enter fullscreen mode Exit fullscreen mode

The plan is long but is actually quite efficient:

                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on pgbench_history (actual time=911.500..911.500 rows=0 loops=1)
   CTE ttl
     ->  Result (actual time=0.334..0.335 rows=1 loops=1)
   CTE delete_me
     ->  Limit (actual time=29.006..41.260 rows=10000 loops=1)
           InitPlan 2 (returns $1)
             ->  CTE Scan on ttl (actual time=0.338..0.338 rows=1 loops=1)
           ->  Index Only Scan using pgbench_history_mtime on pgbench_history pgbench_history_1 (actual time=29.005..40.024 rows=10000 loops=1)
                 Index Cond: (mtime < $1)
                 Rows Removed by Index Recheck: 8638
                 Heap Fetches: 0
                 Storage Index Read Requests: 19
                 Storage Index Read Execution Time: 20.193 ms
   ->  YB Batched Nested Loop Join (actual time=68.657..792.157 rows=10000 loops=1)
         Join Filter: (pgbench_history.id = delete_me.id)
         ->  HashAggregate (actual time=57.119..59.902 rows=10000 loops=1)
               Group Key: delete_me.id
               ->  CTE Scan on delete_me (actual time=29.308..49.311 rows=10000 loops=1)
         ->  Index Scan using pgbench_history_pkey on pgbench_history (actual time=68.299..69.875 rows=1000 loops=10)
               Index Cond: (id = ANY (ARRAY[delete_me.id, $5, $6, ..., $1027]))
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 10.947 ms
               Storage Table Write Requests: 1000
               Storage Index Write Requests: 1000
               Storage Flush Requests: 1
               Storage Flush Execution Time: 53.377 ms
(26 rows)

Enter fullscreen mode Exit fullscreen mode

This reads a list of 10000 rows to delete, in 40 milliseconds, then, using Batched Nested Loop Join, deletes them in 10 loops.

Now, you have a few ideas and, after testing which one suits your workload, and validating the scalability of the execution plan, you can schedule it. If the resource usage is acceptable, it can be a Keep It Stupid Simple solution for Time-To-live, without the complexity and limitations of partitioning.

A final note: do not fear deletes too much.
If you come from other databases, like Oracle, you may think that a delete is the most expensive operation. Oracle has to copy the whole row to the undo segments, and update all indexes, including the primary key, to flag the deleted entries, that will be deleted after the transaction commits, by other DML touching the same blocks. PostgreSQL is more efficient as a deletes simply marks xmax but the space taken by the row and all index entries needs VACUUM to be reclaimed. A delete in YugabyteDB simply adds a tombstone to the key, and background compaction will remove all in the background. I've analyzed that in a previous blog post:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .