Distributing PostgreSQL on Amazon Elastic Kubernetes

Franck Pachot - Jul 21 - - Dev Community

After attending the Global AWS Heroes Summit in Seattle, I'm traveling though US and going to present and demonstrate YugabyteDB to the following locations:

I would be glad to meet you in those locations, and I can share the slides with you afterward. I will also do a live demonstration of YugabyteDB's elasticity and resilience using a great platform for a cloud-native database: Amazon Elastic Kubernetes Service (Amazon EKS). Here is what I will most likely present during the demonstration.

I use an AWS user with the minimum IAM policies for EKSCTL

I'll create an Amazon EKS cluster using the eksctl command-line tool. I'll configure the cluster with specific node types, sizes, and regions, and then update my .kube/config to manage the cluster using kubectl.

eksctl create cluster --name yb-demo1 --tags eks=yb-demo1 \
 --version 1.27 \
 --node-type c5.2xlarge --nodes 4 --nodes-min 0 --nodes-max 4 \
 --region eu-west-1 --zones eu-west-1a,eu-west-1b,eu-west-1c \
 --nodegroup-name standard-workers --managed --dumpLogs

aws eks update-kubeconfig --region eu-west-1 --name yb-demo1

Enter fullscreen mode Exit fullscreen mode

I'll set up a single-region highly available YugabyteDB cluster on Amazon EKS. I'll automate the creation of storage classes and StatefulSets for each availability zone within a specified region, eu-west-1, ensuring the proper configuration of multi-AZ deployments. This process involves generating necessary Kubernetes manifests, applying them, and using Helm to deploy the YugabyteDB database.

cd
region=eu-west-1
masters=$(for az in {a..c} ; do echo "yb-master-0.yb-masters.yb-demo-${region}${az}.svc.cluster.local:7100" ; done | paste -sd,)
placement=$(for az in {a..c} ; do echo "aws.${region}.${region}${az}" ; done | paste -sd,)
echo
echo "$masters"
echo "$region"
echo "$placement"

# create per-AZ StorageClass and StatefulSet
{
for az in {a..c} ; do

{ cat <<CAT
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard-${region}${az}
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
  zone: ${region}${az}
CAT
} > ${region}${az}.storageclass.yaml
echo "kubectl apply -f ${region}${az}.storageclass.yaml"

{ cat <<CAT
isMultiAz: True
AZ: ${region}${az}
masterAddresses: "$masters"
storage:
  ephemeral: true
  master:
    storageClass: "standard-${region}${az}"
  tserver:
    storageClass: "standard-${region}${az}"
replicas:
  master: 1
  tserver: 2
  totalMasters: 3
resource:
  master:
   requests:
    cpu: 0.5
    memory: 1Gi
  tserver:
   requests:
    cpu: 1
    memory: 2Gi
gflags:
  master:
    placement_cloud: "aws"
    placement_region: "${region}"
    placement_zone: "${region}${az}"
    ysql_num_shards_per_tserver: 2
    tserver_unresponsive_timeout_ms: 10000
    hide_dead_node_threshold_mins: 15
  tserver:
    placement_cloud: "aws"
    placement_region: "${region}"
    placement_zone: "${region}${az}"
    ysql_num_shards_per_tserver: 2
    ysql_enable_auth: true
    yb_enable_read_committed_isolation: true
    ysql_suppress_unsafe_alter_notice: true
    ysql_beta_features: true
    allowed_preview_flags_csv: ysql_yb_ash_enable_infra,ysql_yb_enable_ash
    ysql_yb_ash_enable_infra: true
    ysql_yb_enable_ash: true
    transaction_rpc_timeout_ms : 15000
    transaction_max_missed_heartbeat_periods  : 40
CAT
} > ${region}${az}.overrides.yaml
echo "kubectl create namespace yb-demo-${region}${az}"

echo "helm install yb-demo yugabytedb/yugabyte --namespace yb-demo-${region}${az} -f ${region}${az}.overrides.yaml --wait \$* ; sleep 5"

done

echo "kubectl exec -it -n yb-demo-${region}a yb-master-0 -- yb-admin --master_addresses $masters modify_placement_info $placement 3"

} | tee yb-demo-install.sh

Enter fullscreen mode Exit fullscreen mode

This has generated the install script. I'll start by adding the YugabyteDB Helm repository and updating it, then proceed to deploy YugabyteDB using the script.

# install yugabyte Helm chart
helm repo add yugabytedb https://charts.yugabyte.com
helm repo update
helm search repo yugabytedb/yugabyte | cut -c-75

export PATH="$HOME/cloud/aws-cli:$PATH"

. yb-demo-install.sh

Enter fullscreen mode Exit fullscreen mode

I'll share the internal web console, which is open to the public web, with the attendees. This will allow them to view the deployment, tables, and perform read/write operations on their own. Here's how I generate a QR code for the URL of this UI service:

for i in 0 ; do echo $(kubectl get svc -n yb-demo-eu-west-1a --field-selector "metadata.name=yb-master-ui"  -o jsonpath='http://{.items['$i'].status.loadBalancer.ingress[0].hostname}:{.items[0].spec.ports[?(@.name=="http-ui")].port}') ; done | while read url ; do echo "yb-master-ui $i" ; qrencode -t ANSIUTF8 -s 1 -m 0 "$url" ; echo "$url" ; done
Enter fullscreen mode Exit fullscreen mode

PgBench

I'll use kubectl to start a PostgreSQL pod in eu-west-1a, set the necessary environment variables to connect to the load balancer in this availability zone, and execute the pgbench tool to create the tables with primary and foreign keys. Then I fill the table with scale 150 and YugabyteDB non-transactional writes for faster bulk load.

kubectl -n yb-demo-eu-west-1a run -it eu-west-1a --image=postgres:17beta2 --rm=true bash

export PGPASSWORD=yugabyte PGUSER=yugabyte PGDATABASE=yugabyte PGPORT=5433 PGHOST=yb-tservers

pgbench -iIdtpf

PGOPTIONS="-c yb_disable_transactional_writes=on -c yb_default_copy_from_rows_per_transaction=0" pgbench -iIg --scale 150
psql -c "select tablename, pg_size_pretty(pg_table_size(tablename::regclass)) from pg_tables where schemaname='public'"
Enter fullscreen mode Exit fullscreen mode

I'll run the Simple Update benchmark from 20 clients.


pgbench --client 20 -nN --progress 5 --max-tries 10 -T 3600

Enter fullscreen mode Exit fullscreen mode

Elasticity

To demonstrate how we can scale without downtime, I'll double the number of tablet servers in each availability zone by simply scaling the StatefulSets:

kubectl scale -n yb-demo-eu-west-1a statefulset yb-tserver --replicas=4
kubectl scale -n yb-demo-eu-west-1b statefulset yb-tserver --replicas=4
kubectl scale -n yb-demo-eu-west-1c statefulset yb-tserver --replicas=4
Enter fullscreen mode Exit fullscreen mode

The PgBench application will continue to run, and the throughput will increase. However, because PgBench runs with a static set of connections initiated when starting it, it will not fully benefit from all nodes and only the reads and writes will be rebalanced. Restarting PgBench will also balance the connections to the new nodes and the throughput will double. Of course, with a connection pool, you wouldn't need to restart it.

Resilience

To demonstrate resilience to failure, one node can be removed by scaling down one StatefulSet. I will even scale down an AZ to zero to simulate a zone failure:

kubectl scale -n yb-demo-eu-west-1c statefulset yb-tserver --replicas=3
kubectl scale -n yb-demo-eu-west-1c statefulset yb-tserver --replicas=0
kubectl scale -n yb-demo-eu-west-1c statefulset yb-tserver --replicas=4
Enter fullscreen mode Exit fullscreen mode

The PgBench application should continue without any errors. The only consequence of the failure is a few seconds of decreased throughput due to various TCP/IP timeouts for connections to the failed nodes.

Cleanup

You can play as you want with such lab. Don't forget to terminate the EKS cluster:

aws cloudformation delete-stack --stack-name eksctl-yb-demo1                                                         aws cloudformation delete-stack --stack-name eksctl-yb-demo1-cluster
aws cloudformation delete-stack --stack-name eksctl-yb-demo1-nodegroup-standard-workers
eksctl delete cluster --region=eu-west-1 --name=yb-demo1
Enter fullscreen mode Exit fullscreen mode

At those locations, I won't be presenting alone. Colleagues from YugabyteDB, AWS partners, and Snowflake and Striim speakers will also be presenting. It will be an excellent opportunity to network, learn, and share.

Multi-Region Database Roadshow

Join us at a city near you for an evening of tech talks, show-and-tell sessions, and demos accompanied by food, drinks, prizes, and giveaways

favicon info.yugabyte.com

Some parameters explanations

I default to two tablets per nodes because I have the intention to double the number of nodes. This is set by ysql_num_shards_per_tserver=2.

PgBench creates the table (t) and adds the primary key (p) with an ALTER TABLE. This raises a notice in YugabyteDB (I'm running version 2024.1) because it is not an online operation. I suppress the notice for clarity. The PgBench initialization still shows the following when creating the table:

WARNING:  storage parameter fillfactor is unsupported, ignoring
Enter fullscreen mode Exit fullscreen mode

This message is misleading because it doesn't mean it is not supported; it just means it doesn't apply to LSM Trees used by YugabyteDB instead of PostgreSQL heap tables.

In the version I'm running, the ANALYZE is still in beta, so I set ysql_beta_features=true to avoid being notified.

I enabled Active Session History, which is still in preview with the following:

--allowed_preview_flags_csv=ysql_yb_ash_enable_infra,ysql_yb_enable_ash
--ysql_yb_ash_enable_infra=true
--ysql_yb_enable_ash=true
Enter fullscreen mode Exit fullscreen mode

I enable the read committed mode, which is not by default in this version and is the isolation level expected by PgBench: yb_enable_read_committed_isolation=true. This mode also reduces the probability of errors (node failure or clock skew) as it allows transparent retries of statements within a transaction.

By default, the transaction timeout is 5 seconds, defined as transaction_rpc_timeout_ms=5000 for the remote call and transaction_max_missed_heartbeat_periods=10 (heartbeats are every transaction_heartbeat_usec=500000) for the transaction coordinator.

I bumped this to 15 seconds to avoid the following error for long transactions when waiting for the 15 seconds TCP timeout:

pgbench: error: client 6 script 0 aborted in command 8 query 0: 
ERROR:  UpdateTransaction RPC (request call id 2745679) to 192.168.87.205:9100 timed out after 4.972s
Enter fullscreen mode Exit fullscreen mode

(If you were at the demo in Palo Alto or Austin, I didn't set those parameters and got this error - I've added them for Boston's demo)

It can still happen that during a crash or unstable network, a write operation is completed on the table server's DocDB layer, but the response to the YSQL layer is lost. Then, the YSQL layer transparently retries it but with a safeguard to avoid duplicate operations. In this case, the following error can be raised:

pgbench: error: client 10 script 0 aborted in command 7 query 0: 
ERROR:  Duplicate request 126674 from client 08b09407-d78e-4c10-982f-be446d9a4e87 (min running 126674)
Enter fullscreen mode Exit fullscreen mode

A non-accessible tablet server on YB-Master is marked as "DEAD" after one minute. To quickly visualize this during the demo, I reduce the time to 15 seconds using the tserver_unresponsive_timeout_ms=15000 flag. Additionally, I set hide_dead_node_threshold_mins=15 to ensure that the tablet server disappears from the list after 15 minutes instead of the default 24 hours. This helps to illustrate that if the node comes back after 15 minutes, it is considered a new node because it has passed the 15-minute threshold defined by follower_unavailable_considered_failed_sec.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .