Please note that MangoDB is now called FerretDB and has moved to:
There is no doubt that one great thing about MongoDB is the API. Many developers loves it. I'm a big fan of SQL, but we need to listen to all users, and they have use-cases for it. Another thing we expect from NoSQL is the scalability.
In this example we have both, open-source, ACID and resilient, with
- MangoDB proxy between MongoDB and PostgreSQL protocols
- YugabyteDB with its PostgreSQL compatible API on top of the fully consistent distributed storage
This MangoDB project is new, and when looking for it you will see Google still trying to tell you that you made a typo. So this post is a first quick test to validate how it works. Things will probably change with contributions to https://github.com/MangoDB-io/MangoDB
MangoDB example
I'll take de demo application from:
git clone https://github.com/MangoDB-io/example
cd example
and install it on my laptop because I became recently a big fan of Docker on Windows.
The docker-compose.yml
starts a PostgreSQL database and I'll replace that with a YugabyteDB one. This is easy.
postgres service
First, I remove the postgres
service. And add the yb-master
and yb-tserver
ones from YugabyteDB docker-compose.yml
I didn't change any parameters, and I keep the defaults:
- host name is yb-tserver (in a distributed database you can connect to any server)
- port is 5433 (this is our default, rather than the 54322 default for PostgreSQL)
- user is yugabyte (and password is the same)
- database is yugabyte (you can create a dedicated one of course)
setup service
I change this in the setup service command, which starts just to create the schema for the application (which is a "todo" list in this example):
psql -h yb-tserver -p 5433 -U yugabyte -d yugabyte -c 'CREATE SCHEMA IF NOT EXISTS todo'
and the docker-compose dependecy is set to yb-tserver
instead of postgres
mangodb service
The application also needs the connection string. We use the PostgreSQL driver as it is the same protocol, and change the host, port and database name only:
depends_on:
- 'postgres'
entrypoint: ["sh", "-c", "psql -h postgres -U postgres -d mangodb -c 'CREATE SCHEMA IF NOT EXISTS todo'"]
Start the application
Here is my final docker-compose.yml
:
version: "3"
volumes:
yb-master-data-1:
yb-tserver-data-1:
services:
client:
build: ./app/client
hostname: 'todo_client'
container_name: 'todo_client'
stdin_open: true
api:
build: ./app/api
hostname: 'todo_api'
container_name: 'todo_api'
nginx:
image: nginx
hostname: 'nginx'
container_name: 'nginx'
ports:
- 8888:8888
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf
yb-master:
image: yugabytedb/yugabyte:latest
container_name: yb-master-n1
volumes:
- yb-master-data-1:/mnt/master
command: [ "/home/yugabyte/bin/yb-master",
"--fs_data_dirs=/mnt/master",
"--master_addresses=yb-master-n1:7100",
"--rpc_bind_addresses=yb-master-n1:7100",
"--replication_factor=1"]
ports:
- "7000:7000"
environment:
SERVICE_7000_NAME: yb-master
yb-tserver:
image: yugabytedb/yugabyte:latest
container_name: yb-tserver-n1
volumes:
- yb-tserver-data-1:/mnt/tserver
command: [ "/home/yugabyte/bin/yb-tserver",
"--fs_data_dirs=/mnt/tserver",
"--start_pgsql_proxy",
"--rpc_bind_addresses=yb-tserver-n1:9100",
"--tserver_master_addrs=yb-master-n1:7100"]
ports:
- "9042:9042"
- "5433:5433"
- "9000:9000"
environment:
SERVICE_5433_NAME: ysql
SERVICE_9042_NAME: ycql
SERVICE_6379_NAME: yedis
SERVICE_9000_NAME: yb-tserver
depends_on:
- yb-master
mangodb:
image: ghcr.io/mangodb-io/mangodb:latest
hostname: 'mangodb'
container_name: 'mangodb'
command:
[
'-listen-addr=:27017',
'-postgresql-url=postgres://yugabyte@yb-tserver:5433/yugabyte',
]
ports:
- 27017:27017
setup:
image: postgres:14.0
hostname: 'setup'
container_name: 'setup'
restart: 'on-failure'
depends_on:
- 'yb-tserver'
entrypoint: ["sh", "-c", "psql -h yb-tserver -p 5433 -U yugabyte -d yugabyte -c 'CREATE SCHEMA IF NOT EXISTS todo'"]
I pull the images, create the containers and run the services:
docker-compose up
Here is the start from command line:
Visible in Docker Desktop:
The YugabyteDB console is available on: http://localhost:7000
The example application:
The application is accessible on http://localhost:8888/ and we can add items to the To-Do list:
This calls the db.collection.insertOne() MongoDB function:
Check the database
This MongoDB call is translated to SQL by the MangoDB proxy. The collection is a table:
$ psql postgres://yugabyte:yugabyte@localhost:5433/yugabyte
psql (12.7, server 11.2-YB-2.9.0.0-b0)
Type "help" for help.
yugabyte=# \dn
List of schemas
Name | Owner
--------+----------
public | postgres
todo | yugabyte
(2 rows)
yugabyte=# set schema 'todo';
SET
yugabyte=# \d
List of relations
Schema | Name | Type | Owner
--------+-------+-------+----------
todo | tasks | table | yugabyte
(1 row)
yugabyte=# \d+ todo.tasks
Table "todo.tasks"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------+-----------+----------+---------+----------+--------------+-------------
_jsonb | jsonb | | | | extended | |
yugabyte=# select * from todo.tasks;
_jsonb
-----------------------------------------------------------------------------------------------------------------------------------------------
{"$k": ["description", "completed", "_id"], "_id": {"$o": "6182627a17462641b80439d4"}, "completed": false, "description": "Play 😎"}
{"$k": ["description", "completed", "_id"], "_id": {"$o": "6182627017462641b80439d3"}, "completed": false, "description": "Start MangoDB"}
{"$k": ["description", "completed", "_id"], "_id": {"$o": "6182626817462641b80439d2"}, "completed": false, "description": "Start YugabyteDB"}
(3 rows)
yugabyte=#
The storage is very simple: one table with one JSONB column.
SQL Statements
I'll track the statements used with the pg_stat_statements extension which is enabled by default in YugabyteDB. Just resetting in my lab:
yugabyte=# select pg_stat_statements_reset();
pg_stat_statements_reset
--------------------------
(1 row)
In the application I refresh, mark "Play" as completed, Insert a new task, delete it, and refresh multiple times.
yugabyte=# select calls,query from pg_stat_statements;
calls | query
-------+----------------------------------------------------------------------------------------------------------------------
1 | select query from pg_stat_statements
1 | INSERT INTO "todo"."tasks" (_jsonb) VALUES ($1)
1 | DELETE FROM "todo"."tasks" WHERE _jsonb->$1 = $2
1 | SELECT _jsonb FROM "todo"."tasks" WHERE _jsonb->$1 = $2
1 | UPDATE "todo"."tasks" SET _jsonb = $1 WHERE _jsonb->'_id' = $2
7 | SELECT _jsonb FROM "todo"."tasks"
1 | select pg_stat_statements_reset()
11 | SELECT COUNT(*) > 0 FROM information_schema.columns WHERE column_name = $1 AND table_schema = $2 AND table_name = $3
11 | SELECT COUNT(*) > 0 FROM information_schema.tables WHERE table_schema = $1 AND table_name = $2
1 | select * from pg_stat_statements
(10 rows)
There are many things to optimize here. Reading the information_schema many time is not the most efficient. We need an index on the ID (which is in the JSON document). And updates are re-writing the whole document. I'll think about this and probably contribute to this open-source project. Probably the ID should be in another column, that we can properly index and shard, rather than scanning the whole table or adding an additional index.
JSONB indexing
As YugabyteDB plugs the distributed storage to a full PostgreSQL query layer, we can even index this Here is the table with just one JSONB (it was created by with CREATE TABLE "todo"."tasks" (_jsonb jsonb);
):
yugabyte=# select * from todo.tasks;
_jsonb
----------------------------------------------------------------------------------------------------------------------------------
{"$k": ["description", "completed", "_id"], "_id": {"$o": "618282aea9a2a141efa3c401"}, "completed": false, "description": "Franck"}
(1 row)
If I select one key, it has to scan the whole table:
yugabyte=# explain analyze select * from todo.tasks where _jsonb->>'_id' = '{"$o": "618282aea9a2a141efa3c401"}';
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on tasks (cost=0.00..105.00 rows=1000 width=32) (actual time=0.815..0.817 rows=1 loops=1)
Filter: ((_jsonb ->> '_id'::text) = '{"$o": "618282aea9a2a141efa3c401"}'::text)
Planning Time: 0.048 ms
Execution Time: 0.874 ms
(4 rows)
But I can create a unique index on it:
yugabyte=# create unique index task_pk ON todo.tasks
((_jsonb ->> '_id'::text) hash);
CREATE INDEX
Do not forget the double parenthesis (this is the PostgreSQL syntax):
- one for the list of columns to index,
- and one because it is not directly a column but a value derived from the JSON document.
The HASH modifier is optional here because hash sharding is the default on the first column. And this is what we want on this generated identifier. But if you have range scans, you could change it to ASC or DESC.
Now I have a fast access to the document:
yugabyte=# explain analyze select * from todo.tasks where _jsonb->>'_id' = '{"$o": "618282aea9a2a141efa3c401"}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using task_pk on tasks (cost=0.00..4.12 rows=1 width=32) (actual time=13.617..13.622 rows=1 loops=1)
Index Cond: ((_jsonb ->> '_id'::text) = '{"$o": "618282aea9a2a141efa3c401"}'::text)
Planning Time: 11.593 ms
Execution Time: 13.706 ms
(4 rows)
This means that a query with the key will go to the right tablet (the tables and indexes are automatically sharded in YugabyteDB) and to the right row. We are ready to scale out and keep the low latency.
Index Only Indexes
The previous execution plan may require two RPC on a scale-out database: one to the index and one to the table. Because, for better agility, all indexes are global in YugabyteDB. And, of course, with no compromise on strong consistency. But an Index Only Scan would be better. It is easy to acheive (I explained in How a Distributed SQL Database Boosts Secondary Index Queries with Index Only Scan):
yugabyte=# create unique index task_pk ON todo.tasks
((_jsonb ->> '_id'::text) hash) include (_jsonb);
CREATE INDEX
And here is the fastest access you can have to a document on a SQL distributed database, still with the full agility of a JSON document:
yugabyte=# explain analyze select * from todo.tasks where _jsonb->>'_id' = '{"$o": "618282aea9a2a141efa3c401"}';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Index Only Scan using task_pk on tasks (cost=0.00..4.12 rows=1 width=32) (actual time=2.559..2.561 rows=1 loops=1)
Index Cond: (((_jsonb ->> '_id'::text)) = '{"$o": "618282aea9a2a141efa3c401"}'::text)
Heap Fetches: 0
Planning Time: 5.229 ms
Execution Time: 2.607 ms
(5 rows)
Now you have all in the index and don't need the table at all. In PostgreSQL you have no choice as you need to maintain the heap table. But YugabyteDB stores tables in LSM trees where rows are organized for fast access on the primary key. When storing documents into a SQL table, it is better to have the identifier in its own column, an integer or uuid, to really have a (key uuid, value jsonb)
schema. I'll suggest that to the MangoDB project, as well as some other optimizations for PostgreSQL or YugabyteDB. But the essence is there: a simple MongoDB API to distributed SQL database.