To prepare a demo for VoxxedDays Ticino I have created a Kubernetes cluster on Amazon EKS with YugabyteDB, Open Source Distributed SQL, installed as in this post.
YugabyteDB has a Cassandra-compatible API (called YCQL) and a PostgreSQL-compatible one (called YSQL). What if you have an application written for MongoDB that you want to move to a SQL database? You probably want to rewrite it to benefit from SQL features but, rather than re-writing all at the same time, you can add a MongoDB-compatible API on top of PostgreSQL thanks to FerretDB.
Let's add that to my Kubernetes cluster. This is a test. FerretDB is still in early stage. But maybe this combination of a MongoDB proxy and a Distributed SQL database can be an open source alternative to MongoDB or DocumentDB in the future.
Run FerretDB connected to YugabyteDB
To test this possibility I'll start a single pod using the ferretdb/ferretdb
image whish runs ferretdb
. I set its endpoint with --listen-addr
and the PostgreSQL connection to YugabyteDB with --postgresql-url
:
$ kubectl run --image ferretdb/ferretdb ferretdb -- \
--listen-addr=0.0.0.0:27017 \
--postgresql-url="postgres://yugabyte:yugabyte@yb-tservers.yb-demo-eu-west-1a.svc.cluster.local:5433/yugabyte"
pod/ferretdb created
It connects to the ClusterIP service yb-tservers
that distributes the connections to all YugabyteDB Servers in one AZ eu-west-1a
.
I'm testing this from a mongosh
shell:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ferretdb 1/1 Running 0 11m 192.168.50.78 ip-192-168-48-228.eu-west-1.compute.internal <none> <none>
$ kubectl run -it --rm --restart=Never --image mongo mongosh -- \
mongosh mongodb://192.168.50.78:27017/mongodb
I can create a collection
mongodb> db.createCollection('demo');
{ ok: 1 }
The corresponding table is created on YugabyteDB:
It is automatically distributed (sharded and replicated):
I got the IP address from the pod, here. I'll create a service.
Run a Load Balancer for the FerretDB pods
Going though a proxy adds some latency. Then you probably need want to have more pods. I'll create a ReplicaSet for it later but for the moment let's create a LoadBalancer to valudate this idea:
$ kubectl expose pod ferretdb --port=27017 --target-port=27017 \
--name=ferretdb-service --type=LoadBalancer
service/ferretdb-service exposed
I can test the connection through this service:
$ kubectl run -it --rm --restart=Never --image mongo mongosh -- \
mongosh mongodb://ferretdb-service.default.svc.cluster.local:27017/mongodb --quiet
If you don't see a command prompt, try pressing enter.
mongodb> show collections;
demo
mongodb>
I can now use the FerretDB endpoint from the internet, getting the EXTERNAL-IP address:
$ kubectl get services --field-selector "metadata.name=ferretdb-service" -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}'
abebf50dee88a462ab90d371dfdaa44f-1345053103.eu-west-1.elb.amazonaws.com
I can connect with mongosh
from my laptop:
$ mongosh mongodb://abebf50dee88a462ab90d371dfdaa44f-1345053103.eu-west-1.elb.amazonaws.com:27017/mongodb --quiet
mongodb> db.demo.insertOne({ "message": "Hello World" });
{
acknowledged: true,
insertedId: ObjectId("63c85f7aa3db61ccd4324a04")
}
mongodb> db.demo.findOne();
{ _id: ObjectId("63c85f7aa3db61ccd4324a04"), message: 'Hello World' }
mongodb>
Deployement as ReplicaSet
Now that I have tested by manually creating a pod and service, I'll automate that. I remove those created above:
$ kubectl delete service ferretdb-service
service "ferretdb-service" deleted
$ kubectl delete pod ferretdb
pod "ferretdb" deleted
Here is the YAML declaration for a ReplicaSet of FerretDB:
cat > ferretdb-rs.yaml <<'YAML'
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: ferretdb
namespace: yb-demo-eu-west-1a
labels:
run: ferretdb
spec:
replicas: 3
selector:
matchLabels:
run: ferretdb
template:
metadata:
labels:
run: ferretdb
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1a
containers:
- name: ferretdb
image: ferretdb/ferretdb
args:
- --listen-addr=0.0.0.0:27017
- --postgresql-url=postgres://yugabyte:yugabyte@yb-tservers.yb-demo-eu-west-1a.svc.cluster.local:5433/yugabyte
YAML
kubectl apply -f ferretdb-rs.yaml
And the LoadBalancer:
cat > ferretdb-lb.yaml <<'YAML'
apiVersion: v1
kind: Service
metadata:
name: ferretdb-service
namespace: yb-demo-eu-west-1a
labels:
run: ferretdb
spec:
type: LoadBalancer
selector:
run: ferretdb
ports:
- protocol: TCP
port: 27017
targetPort: 27017
YAML
kubectl apply -f ferretdb-lb.yaml
I have created those in one AZ (eu-west-1a
), connecting to the YugabyteDB Tablet Servers in the same AZ.
Connect with Mongo or Postgres clients
I get the mongodb endpoint address:
$ kubectl get services \
-o jsonpath='mongodb://{.items[0].status.loadBalancer.ingress[0].hostname}:{.items[0].spec.ports[0].port}/mongodb' \
--field-selector "metadata.name=ferretdb-service" \
-n yb-demo-eu-west-1a
mongodb://a4ee051c7cf844538b1bc90fbd3ef56a-565835635.eu-west-1.elb.amazonaws.com:27017/mongodb
Using this, I load some sample data:
$ curl -s https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.json |
jq -c ".[]" |
while read json
do
echo "db.covid.insertOne( $json )"
done | mongosh mongodb://a4ee051c7cf844538b1bc90fbd3ef56a-565835635.eu-west-1.elb.amazonaws.com:27017/mongodb
Now getting the PostgreSQL db address:
$ kubectl get services \
-o jsonpath='postgres://{.items[0].status.loadBalancer.ingress[0].hostname}:{.items[0].spec.ports[?(@.name=="tcp-ysql-port")].port}/yugabyte' \
--field-selector "metadata.name=yb-tserver-service" \
-n yb-demo-eu-west-1a
postgres://a4ee051c7cf844538b1bc90fbd3ef56a-565835635.eu-west-1.elb.amazonaws.com:5433/yugabyte
I can connect with psql
:
]$ psql postgres://a62cf0d1ce2c94b6a8fb7d169da3b75b-2104684727.eu-west-1.elb.amazonaws.com:5433
Pager usage is off.
psql (13.7, server 11.2-YB-2.17.0.0-b0)
Type "help" for help.
yugabyte=# select "_jsonb"->>'_id' as collection, "_jsonb"->>'table' as tablename
from mongodb."_ferretdb_database_metadata";
collection | tablename
------------+----------------
demo | demo_aefd3536
covid | covid_e0d7cdae
(2 rows)
yugabyte=# select count(*) from mongodb.covid_e0d7cdae;
count
-------
237
(1 row)
yugabyte=# select "_jsonb"->>'location' as location
,("_jsonb"->>'new_cases_per_million')::numeric as new_cases_per_million
from mongodb.covid_e0d7cdae
where ("_jsonb"->>'last_updated_date')::date > now()- interval '5 days'
order by 2 desc nulls last limit 10;
location | new_cases_per_million
------------+-----------------------
Brunei | 5449.864
Cyprus | 3742.158
San Marino | 2374.592
Italy | 1423.841
Uruguay | 1096.764
Japan | 1022.971
Tonga | 963.815
Taiwan | 835.377
Mauritius | 783.391
Ireland | 781.986
(10 rows)
There is still a lot to do to get the right performance. FerretDB creates a table with a single JSONB column with no primary key and adds an index on the object ID attribute of the JSON document. Then YugabyteDB creates an internal ID to distribute on it. It would be better to have it a the sharding key but that must be out of the JSON document.
There is still work in progress in FerretDB, like Create primary key index for _id automatically #1384
. If a Primary Key is added with with the ID, it will become the sharding key in YugabyteDB. Another optimization will be to avoid reading information_schema.columns
which is slow on YugabyteDB (the catalog must be shared by all nodes). This will be optimized on YugabyteDB (#7745). If it is still a scalability issue, there's also the possibility to fork the PostgreSQL handler to optimize it for YugabyteDB. All this is open source 🤩 YugabyteDB and FerretDB are Apache License 2.0