Update: the issue described below has been fixed and the DuckDB postgres_scanner can now push down the filters. Here is how I start DuckDB to connect to YugabyteDB database yb_demo_northwind on my.host:
PGDATABASE=yb_demo_northwind duckdb -cmd"
set pg_debug_show_queries to true;
set pg_experimental_filter_pushdown to true;
set pg_use_ctid_scan to false;
attach 'user=yugabyte host=my.host port=5433' AS db (TYPE POSTGRES, READ_ONLY);
"
With pg_debug_show_queries set, it shows the queries sent to YugabyteDB, with a WHERE clause that YugabyteDB will push down to the storage:
That's all. What follows below is the old blog post.
DuckDB is an open-source in-process SQL OLAP database, with the possibility to query PostgreSQL Tables. YugabyteDB is an open-source distributed SQL database optimized for OLTP and is PostgreSQL-compatible. You can see the temptation to marry them and be able to run some OLAP queries on top of the scalable OLTP.
If you have tried to run DuckDB on top of YugabyteDB, you may have encountered the following error:
ERROR: System column "ctid" is not supported yet
The reason is that DuckDB uses the PostgreSQL ctid to parallelize the scan of the source table. This is described in the DuckDb Postgres Scanner description
YugabyteDB is PostgreSQL-compatible, but with a different, distributed, storage that shards tables and indexes, and stores tuples in LSM-Tree where there's no equivalent of the PostgreSQL heap table's ctid.
Reading in parallel could use ranges on the primary key, or yb_hash_code(). For the moment, my simple workaround is not trying to parallelize the reads at this level, which is really easy because in YugabyteDB pgclass.relpages is zero and then DuckDB should not try start multiple threads.
The postgresscanner extension allows DuckDB to directly read data from a running Postgres instance. The data can be queried directly from the underlying Postgres tables, or read into DuckDB tables.
Usage
To make a Postgres database accessible to DuckDB, use the POSTGRES_ATTACH command:
CALL POSTGRES_ATTACH('');
POSTGRES_ATTACH takes a single required string parameter, which is the libpq connection string. For example you can pass 'dbname=postgresscanner' to select a different database name. In the simplest case, the parameter is just ''. There are three additional named parameters:
source_schema the name of a non-standard schema name in Postgres to get tables from. Default is public.
sink_schema the schema name in DuckDB to create views. Default is main.
overwrite whether we should overwrite existing views in the target schema, default is false.
filter_pushdown whether filter predicates that DuckDB derives from the query should…
I changed a single thing to be able to run it: I test the range of ctid to read (task_min, task_max) and just replace the ctid BETWEEN '(%d,0)'::tid AND '(%d,0)'::tid with 1=1 when ( task_min == 0 && task_max == POSTGRES_TID_MAX ).
Gitpod
I've also added a gitpod.yml to build it automatically in Gitpod:
tasks:-init:sudo apt-get install -y cmake curl bison && make
Start YugabyteDB
To test it, I start a YugabyteDB container with the Northwind demo tables:
dockerrun-d--name yb -p 5433:5433 -p 15433:15433 \yugabytedb/yugabytebash-c'
yugabyted start
until yugabyted demo connect <<'SQL'
\x
select *
from pg_stat_activity
where pid!=pg_backend_pid() and datid is not null;
\watch 5
SQL
do sleep 1 ; done 2>/dev/null
'
The port 5433 is the PostgreSQL endpoint that I will use to connect to the yugabyte database with yugabyte user.
The port 15433 is the GUI where you can see the running queries.
This reads the base tables and executes the SQL statement on them:
I can check the queries that have been run on YugabyteDB
The most important here is the COPY in FORMAT BINARY. You can see how far the YugabyteDB compatibility goes: the binary format works. Of course, the idea of the PostgreSQL scanner is to query all tables in parallel by chunks, and this is where CTID is used. With my workaround, full tables are read with one COPY. I think that if there's a need for optimization, this will be done in YugabyteDB parallelism.
It still works with vanilla PostgreSQL
To test that this patch still works with normal PostgreSQL, I test it with the RNA Central database: