You may want to capture a workload in a database and replay it. Oracle has RAT (Real Application Testing), PostgreSQL has pgreplay (https://github.com/laurenz/pgreplay) maintained by Laurenz Albe.
YugabyteDB, the PostgreSQL-compatible open-source distributed SQL database, uses the postgres query layer and can use pgreplay as a target or source. This is the beauty of adding distributed capabilities to the PostgreSQL code base. Here is the quick test I've run to verify all works as expected.
I will run this example on a lab with only one node. I define the log destination as csvlog
, and set the log parameters as recommended by the documentation:
docker run -v /var/tmp/yb:/var/tmp \
-d --rm --name yb yugabytedb/yugabyte:2.15.2.0-b87 \
bash -c '
cat > tserver.flagfile <<CAT
--ysql_pg_conf_csv=\
log_destination=csvlog,\
log_statement=all,\
log_min_messages=error,\
log_min_error_statement=log,\
log_connections=on,\
log_disconnections=on
CAT
yugabyted start --tserver_flags=flagfile=tserver.flagfile
while true ; do tail -F /root/var/logs/tserver/postgres* ; done
'
I export /var/tmp
as an external volume. This is where I will copy the log for the workload I want to replay. I'll rotate the log with pg_rotate_logfile()
and get the file name with pg_current_logfile()
. The YugabyteDB query layer (YSQL) is PostgreSQL, you can use the same parameters and functions.
I initialize pgbench
tables (pgbench
is ysql_bench
in the YugabyteDB distribution, adding a few additional features, but you can also use the PostgresSQL pgbench
):
docker exec -i yb bash -c '
/home/yugabyte/postgres/bin/ysql_bench -i
'
Capture
I run a simple pgbench
, taking care to rotate the logfile before, and copy it into /var/tmp
at the end:
docker exec -i yb bash -c "
ysqlsh -c \"select pg_rotate_logfile()\"
/home/yugabyte/postgres/bin/ysql_bench -n
ysqlsh -tc \"
select format('cp %L /var/tmp/workload.csv',pg_current_logfile())
\" | sh -x
"
The csvlog
covering this pgbench
workload is /var/tmp/workload.csv
, ready to be replayed.
Usually, you replay it on a clone of the database (see PITR snapshot: an easy flashback / backtrack for application releases) or pg_dump
(ysql_dump
is the YugabyteDB version) from the initial state.
Here, I will simply reinit with pgbench -i
. I also count and sum the rows to verify the replay (pgbench
uses random amounts):
docker exec -i yb bash -c '
ysqlsh -c "select count(*), sum(abalance) from ysql_bench_accounts;"
/home/yugabyte/postgres/bin/ysql_bench -i
ysqlsh -c "select count(*), sum(abalance) from ysql_bench_accounts;"
'
Replay
I build a docker image with pgreplay
:
git clone https://github.com/laurenz/pgreplay.git
cd pgreplay
docker build -t laurenz/pgreplay -f Dockerfile .
It can be interesting to look at the csvlog
. I'm using VisiData for that. You can install it with pip3 install visidata
or use a docker image with it - mounting the /var/tmp/yb
volume to it:
docker run --rm -it -w /logs -v /var/tmp/yb:/logs \
jauderho/visidata:latest \
workload.csv
There's no header but the format is described in https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG
I run pgreplay
in a container from the image I've built above, accessing the volume to read the csvlog
with -v
and the database to run the statements with --link
:
docker run --rm -ti -w /logs --link yb:yb \
-v /var/tmp/yb:/logs \
laurenz/pgreplay pgreplay -c -h yb -p 5433 \
/logs/workload.csv
I check that I have the same values to confirm the replay:
docker exec -i yb bash -c '
ysqlsh -c "select count(*), sum(abalance) from ysql_bench_accounts;"
'
This blog was just to verify that it works with YugabyteDB. Check pgreplay documentation for more, all works the same in YugabyteDB. If you want to capture a workload from connections on multiple database nodes, each one will have their logfile. You can merge them. The Session ID (the 6th field in the csvlog
built from start time and backend pid will probably not collide with another one, but you can make it unique by concatenating a node number if you want). The replay connects to one node, but though a HA proxy the connections can be distributed to multiple ones. All depends on what you want to capture and wh you want to replay. Capturing from PostgreSQL and replaying to YugabyteDB is also a good way to check that all works the same without performance regressions.