CockroachDB ships with a very convenient built-in web monitoring interface, the DB Console. In the DB Console you can visualize all important health and ops metrics by using the pre-configured dashboards. Currently, the DB Console sports 12 dashboards, covering anything from Hardware and Storage metrics to SQL and Distribution. For many customers, this is a great monitoring solution.
Larger enterprises however usually have a separate team that is responsible to monitor pretty much every component of an application, including databases, so they have a centralized solution from which they can more holistically assess the health of an application. For these cases, the same metrics that power the CockroachDB Console dashboards can be forwarded to the enterprise monitoring solution.
Recently, I worked on such an integration with Splunk. The Splunk dashboard files that emulate the DB Console are now available in our repo for everyone's benefit.
In this blog, I demonstrate how I used the OpenTelemetry Collector to send the metrics from CockroachDB to a Splunk instance in a way that can be done on one's laptop using docker.
Setup
The architecture is simple: CockroachDB --> OTEL Collector --> Splunk.
CockroachDB generates detailed time series metrics for each node in the cluster. The collector will pull these metrics from each node endpoint and push them to the Splunk HEC endpoint.
Once in Splunk, it's just a matter to make sense of the metrics by building the right charts and group them into dashboards.
CockroachDB Cluster
As customary, we use a Load Balancer to interact with the CockroachDB cluster.
Create the haproxy.cfg
file and save it on the current directory.
# file: haproxy.cfg
global
maxconn 4096
defaults
mode tcp
timeout connect 10s
timeout client 10m
timeout server 10m
option clitcpka
listen psql
bind :26257
mode tcp
balance roundrobin
option httpchk GET /health?ready=1
server cockroach1 cockroach1:26257 check port 8080
server cockroach2 cockroach2:26257 check port 8080
server cockroach3 cockroach3:26257 check port 8080
server cockroach4 cockroach4:26257 check port 8080
listen http
bind :8080
mode tcp
balance roundrobin
option httpchk GET /health?ready=1
server cockroach1 cockroach1:8080 check port 8080
server cockroach2 cockroach2:8080 check port 8080
server cockroach3 cockroach3:8080 check port 8080
server cockroach4 cockroach4:8080 check port 8080
Create the docker network and containers
# create the network bridge
docker network create --driver=bridge --subnet=172.28.0.0/16 --ip-range=172.28.0.0/24 --gateway=172.28.0.1 demo-net
# CockroachDB cluster
docker run -d --name=cockroach1 --hostname=cockroach1 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3
docker run -d --name=cockroach2 --hostname=cockroach2 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3
docker run -d --name=cockroach3 --hostname=cockroach3 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3
docker run -d --name=cockroach4 --hostname=cockroach4 --net demo-net cockroachdb/cockroach:latest start --insecure --join=cockroach1,cockroach2,cockroach3
# initialize the cluster
docker exec -it cockroach1 ./cockroach init --insecure
# HAProxy load balancer
docker run -d --name haproxy --net demo-net -p 26257:26257 -p 8080:8080 -v `pwd`/haproxy.cfg:/etc/haproxy.cfg:ro haproxy:latest -f /etc/haproxy.cfg
At this point you should be able to open the CockroachDB Console at http://localhost:8080.
Start a workload against the cluster to generate some metrics
# init
docker run --rm --name workload --net demo-net cockroachdb/cockroach:latest workload init tpcc 'postgres://root@haproxy:26257?sslmode=disable' --warehouses 10
# run the workload - you might want to use a separate terminal
docker run --rm --name workload --net demo-net cockroachdb/cockroach:latest workload run tpcc 'postgres://root@haproxy:26257?sslmode=disable' --warehouses 10 --tolerate-errors
With 4 nodes, you can simulate a node failure (just stop the container) and view the range activity (replication, lease-transfers, etc).
You can optionally setup CDC to a Kafka container, configure Row Level TTL, add more nodes, etc.
Splunk
Start a Splunk container.
docker run -d --name splunk --net demo-net -p 8088:8088 -p 8000:8000 -e "SPLUNK_START_ARGS=--accept-license" -e "SPLUNK_PASSWORD=cockroach" splunk/splunk
Login into Splunk at http://localhost:8000 as user admin
with password cockroach
.
- Create a data input and token for HEC.
- In Splunk, click Settings > Data Inputs.
- Under Local Inputs, click HTTP Event Collector:
- Click Global Settings.
- For All Tokens, click Enabled if this button is not already selected.
- Uncheck the SSL checkbox.
- Click Save.
- Configure an HEC token for sending data by clicking New Token.
- On the Select Source page, for Name, enter a token name, for example "Metrics token".
- Leave the other options blank or unselected.
- Click Next.
- On the Input Settings page, for Source type, click New.
- In Source Type, set value to "otel".
- For Source Type Category, select Metrics.
- Next to Default Index, click Create a new index.
In the New Index dialog box:
- Set Index Name to "metrics_idx".
- For Index Data Type, click Metrics.
- Click Save.
- Select the newly created "metrics_idx".
- Click Review, and then click Submit.
- Copy the Token Value that is displayed. This HEC token is required for sending data.
OpenTelemetry Collector
There are 2 collector types: the core and the contrib. I have used the contrib as it features the splunk_hec exporter.
The collectors come already pre-compiled and are available for download in the releases repo. Docker containers are also available.
Create file config.yaml
and save it in the current directory.
Ensure to replace the Splunk Token with the one you created in the previous step.
# file: config.yaml
---
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'cockroachdb'
metrics_path: '/_status/vars'
scrape_interval: 10s
scheme: 'http'
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- cockroach1:8080
- cockroach2:8080
- cockroach3:8080
- cockroach4:8080
labels:
cluster_id: 'cockroachdb'
exporters:
splunk_hec:
source: otel
sourcetype: otel
index: metrics_idx
max_connections: 20
disable_compression: false
timeout: 10s
tls:
insecure_skip_verify: true
token: TOKEN
endpoint: "http://splunk:8088/services/collector"
service:
pipelines:
metrics:
receivers:
- prometheus
exporters:
- splunk_hec
Start the Collector
docker run -d --name otel --net demo-net -v `pwd`/config.yaml:/etc/config.yaml ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib --config=/etc/config.yaml
CockroachDB metrics should be now pulled by the Prometheus Receiver and forwarded to Splunk via the Splunk HEC Exporter.
Demo
After few minutes, you can do a quick test and run the below queries in Splunk to make sure data is received correctly.
In Splunk, click on Apps, then Search & Reporting.
Enter below commands:
# check what metrics we're receiving
| mcatalog values(metric_name) WHERE index="metrics_idx"
# preview the data in its raw format
| mpreview index="metrics_idx"
# execute the query to show the SQL Statements
| mstats
rate_sum(sql_select_count) as select,
rate_sum(sql_insert_count) as insert,
rate_sum(sql_update_count) as update,
rate_sum(sql_delete_count) as delete
where index="metrics_idx" span=10s
If Splunk shows data, the pipeline is working correctly and you can load the dashboards.
- Click on Dashboards, then on Create New Dashboard.
- In the pop-up window:
- In Dashboard Title, set "CockroachDB Overview".
- Select Classic Dashboard.
- Click Create.
- The Dashboard is now in edit mode. Click on Source.
- Replace the current content with the XML in the
overview.xml
file in the repo. - Click Save.
- Repeat for every Dashboard file in the repo directory.
Here are a few screenshots:
Of course, you can create your very own dashboards, too:
The same Collector can be used for integration with many other sources and targets. It also features processors, so you can pre-process the data (say, filtering) before pushing it to your target solution.