By default, YugabyteDB COPY does intermediate commits every 20000 rows:
yugabyte=# show yb_default_copy_from_rows_per_transaction;
yb_default_copy_from_rows_per_transaction
-------------------------------------------
20000
(1 row)
yugabyte=# show yb_disable_transactional_writes;
yb_disable_transactional_writes
---------------------------------
on
(1 row)
Let's take an example with the following table:
yugabyte=# create table loaded_data ( id bigserial, data text );
CREATE TABLE
I set statement_timeout
to 5 seconds to simulate a failure before the end, and load some rows:
yugabyte=# set statement_timeout=5000;
SET
yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
ERROR: 57014: canceling statement due to statement timeout
yugabyte=# select count(*) from loaded_data;
count
--------
120000
(1 row)
The statement has failed but, because of intermediate commits, some rows are loaded, a multiple of 20000, have been visible during the load, and have to be cleaned up manually:
yugabyte=# delete from loaded_data;
DELETE 120000
To be transactional, you can disable intermediate commits:
yugabyte=# set yb_default_copy_from_rows_per_transaction=0;
SET
yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
ERROR: 57014: canceling statement due to statement timeout
yugabyte=# select count(*) from loaded_data;
count
-------
0
(1 row)
However, this is less efficient, allocating more memory and large inserts into IntentsDB
Another solution is to keep intermediate commits but add a column that will set the visibility at the end when the table is queried though a view. It will use a user-define batch number:
yugabyte=# set app.batchid=1;
SET
yugabyte=# alter table loaded_data add batch bigint default current_setting('app.batchid')::bigint;
ALTER TABLE
A table will store the visible batches, and a view will filter on it:
yugabyte=# create table loaded_batches(batch bigint primary key default current_setting('app.batchid')::bigint );
CREATE TABLE
yugabyte=# create view loaded_view as select * from loaded_data where batch in (select batch from loaded_batches);
CREATE VIEW
I set the batch ID:
yugabyte=# select set_config('app.batchid',extract (epoch from now())::bigint::text,false);
set_config
------------
1710884655
(1 row)
When I COPY, the batch ID will be set to this value. I can use intermediate commits, or even non-transactional inserts which will bypasse the provisional records and make each row visible as it is inserted:
yugabyte=# set yb_disable_transactional_writes=on;
SET
yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
ERROR: 57014: canceling statement due to statement timeout
The rows are visible from the table but not from the view:
yugabyte=# select count(*) from loaded_data;
count
-------
27648
(1 row)
yugabyte=# set yb_bnl_batch_size=1024;
SET
yugabyte=# select count(*) from loaded_view;
count
-------
0
(1 row)
The execution plan is efficient with Batched Nested Loops.
In case of failure as above, the rows can be cleaned-up later:
yugabyte=# delete from loaded_data where batch not in (select batch from loaded_batches);
DELETE 27648
When an insert is sucessful, the batch ID is inserted at the end to make it visible:
yugabyte=# select set_config('app.batchid',extract (epoch from now())::bigint::text,false);
set_config
------------
1710885534
(1 row)
yugabyte=# set statement_timeout=0;
SET
yugabyte=# copy loaded_data(data) from program 'base64 -w 100 /dev/urandom | head -c $(( 1024 * 1024 * 1024 ))';
COPY 10631108
yugabyte=# insert into loaded_batches select;
INSERT 0 1
All new rows have been made atomically visible from the view:
yugabyte=# select count(*) from loaded_data;
count
----------
10631108
(1 row)
yugabyte=# select count(*) from loaded_view;
count
----------
10631108
(1 row)
This is a trade-off: faster load with non-transactional inserts, but the overhead of filtering on queries.