Hello 👋, this post is about auto instrumenting a spring boot app with the otel agent running on EKS, forward the telemetry signals(metrics, logs, traces) to backends such as prometheus, loki, tempo via the open telemetry collector, and visualize those in Grafana. The setup is as in the diagram below:

Note that the datasources are added on Grafana, which means Grafana sends them API requests to fetch the data.

Alright, let's get started!

Prerequisites

Ensure you have installed tools such as the aws cli, eksctl, git, docker, kubectl, helm and that you have authenticated to aws from cli.

Setup the demo app

Let's clone a simple spring boot app from github, and switch to complete directory, where the application code is present.

git clone https://github.com/spring-guides/gs-rest-service.git

cd gs-rest-service/complete

We can add a logger to the application in the controller class.

cat > src/main/java/com/example/restservice/GreetingController.java <<EOF 
package com.example.restservice;

import java.util.concurrent.atomic.AtomicLong;

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

@RestController
public class GreetingController {

        private static final String template = "Hello, %s!";
        private final AtomicLong counter = new AtomicLong();

        private final Logger logger = LoggerFactory.getLogger(GreetingController.class);

        @GetMapping("/greeting")
        public Greeting greeting(@RequestParam(value = "name", defaultValue = "World") String name) {
                logger.info("Received GET greeting request for {}", name);
                return new Greeting(counter.incrementAndGet(), String.format(template, name));
        }
} 
EOF

Build the Docker image

Add a dockerfile to containerize this application, also include the otel agent in it with the curl command. Note that I'm using maven for the build, however the app has both maven and gradle specs, so you can build it with gradle too just by changing the line in dockerfile from ./mvnw clean package to ./gradlew clean build.

cat > Dockerfile <<EOF
# Build
FROM eclipse-temurin:17-jdk-jammy AS build
WORKDIR /usr/app
COPY . .
RUN ./mvnw clean package

# Run
FROM eclipse-temurin:17-jre-jammy
WORKDIR /usr/app
COPY --from=build /usr/app/target/*.jar ./app.jar
RUN curl -O -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
RUN useradd appuser && chown -R appuser /usr/app
USER appuser
HEALTHCHECK --interval=30s --timeout=5s --start-period=5s --retries=3   CMD curl -f http://localhost:8080/ || exit 1 
ENTRYPOINT ["java", "-javaagent:/usr/app/opentelemetry-javaagent.jar", "-jar", "/usr/app/app.jar"]
EOF

We have added the healthcheck command and a nonroot user in the dockerfile above, so that it passes security checks. We can lint/scan it with trivy as follows.

docker run --rm -v ./Dockerfile:/tmp/Dockerfile aquasec/trivy config /tmp/Dockerfile

Add the dockerignore file so that we are not copying unwanted files to the image.

cat > .dockerignore <<EOF
Dockerfile
manifest.yml
build
target
EOF

We can build the image now.

export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

export REGION=us-east-1

docker build -t $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/otel-auto-java-demo .

Push the image to ECR

The image we built can be pushed to ECR, for which we have to first create a repo there.

aws ecr create-repository --repository-name otel-auto-java-demo --region $REGION

Then, get the ecr password and login with it from docker.

aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

You should get Login Succeeded.

The image can now be pushed to ECR with the docker cli.

docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/otel-auto-java-demo

Create EKS cluster

I'm going to create an EKS cluster with the eksctl cli.

eksctl create cluster --name otel-auto-java-demo --zones=us-east-1a,us-east-1b

When this is done, your kubeconfig should have got updated and you should have scoped to the new cluster. However you can keep this command handy to switch to the correct context aws eks update-kubeconfig --region $REGION --name otel-auto-java-demo.

Setup helmfile

We can deploy our tool stack with helmfile. First check the architecture, for ex. with a command like uname -m and then install helmfile with the correct package. In my case the architecture was x86_64 and hence, I am installing the linux_amd64 package.

uname -m
wget https://github.com/helmfile/helmfile/releases/download/v1.0.0-rc.5/helmfile_1.0.0-rc.5_linux_amd64.tar.gz

tar -xvf helmfile_1.0.0-rc.5_linux_amd64.tar.gz 

rm LICENSE README.md README-zh_CN.md 

sudo mv helmfile /usr/bin/helmfile

Note that you would also need helm as helmfile is a wrapper around helm.

Our helmfile should look as follows.

cat > helmfile.yaml <<EOF
repositories:
- name: open-telemetry
  url: https://open-telemetry.github.io/opentelemetry-helm-charts
- name: prometheus-community
  url: https://prometheus-community.github.io/helm-charts
- name: grafana
  url: https://grafana.github.io/helm-charts

releases:
- name: otel-collector
  chart: open-telemetry/opentelemetry-collector
  namespace: otel-auto-java-demo
  values:
  - otel-collector-values.yaml
- name: prometheus
  chart: prometheus-community/prometheus
  namespace: otel-auto-java-demo
  values:
  - prometheus-values.yaml
- name: loki
  chart: grafana/loki
  namespace: otel-auto-java-demo
  values:
  - loki-values.yaml
- name: tempo
  chart: grafana/tempo-distributed
  namespace: otel-auto-java-demo
  values:
  - tempo-values.yaml
- name: grafana
  chart: grafana/grafana
  namespace: otel-auto-java-demo
  values:
  - grafana-values.yaml
EOF

Setup Helm values

This section shows the different values files, that we are using in the helmfile.

OTEL collector

Let's begin with the otel collector values. Some of the sections such as receivers are omitted, as anyway they will come from the default values file.

cat > otel-collector-values.yaml <<EOF
image:
  repository: "otel/opentelemetry-collector-contrib"
mode: deployment
resources:
  limits:
    cpu: 250m
    memory: 512Mi
config:
  processors:
    batch:
      send_batch_max_size: 500
      send_batch_size: 50
      timeout: 5s
  exporters:
    otlphttp/prometheus:
      endpoint: "http://prometheus-server/api/v1/otlp"
    loki:
      endpoint: "http://loki-write:3100/loki/api/v1/push"
    otlp:
      endpoint: "http://tempo-distributor:4317"
      tls:
        insecure: true
  service:
    pipelines:
      metrics:
        receivers: [otlp]
        exporters: [otlphttp/prometheus]
      logs:
        exporters: [loki]
      traces:
        receivers: [otlp]
        exporters: [otlp]
EOF

Prometheus

We are disabling components that we do not need for this exercise, from the prometheus chart, and we are enabling otlp write receiver, as we are sending metric from the otel collector to prometheus via otlp/http.

cat > prometheus-values.yaml <<EOF
alertmanager:
  enabled: false
prometheus-pushgateway:
  enabled: false
prometheus-node-exporter:
  enabled: false
kube-state-metrics:
  enabled: false
server:
  extraFlags:
  - "enable-feature=exemplar-storage"
  - "enable-feature=otlp-write-receiver"
EOF

Loki

We are installing loki in the simple scalable mode. And have disabled some of the components such as chunks cache, results cache, gateway to keep it simple.

cat > loki-values.yaml <<EOF
# https://grafana.com/docs/loki/latest/operations/caching/
chunksCache:
  enabled: false
resultsCache:
  enabled: false
gateway:
  enabled: false

limits_config:
  allow_structured_metadata: true

loki:
  auth_enabled: false
  schemaConfig:
    configs:
      - from: 2024-04-01
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
  tracing:
    enabled: true
  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 4

deploymentMode: SimpleScalable

backend:
  replicas: 3
read:
  replicas: 3
write:
  replicas: 3

# Enable minio for storage
minio:
  enabled: true

# Zero out replica counts of other deployment modes
singleBinary:
  replicas: 0

ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0
EOF

Tempo

This is our tempo values. Note that we are using the tempo distributed chart, i.e. the microservices mode.

cat > tempo-values.yaml <<EOF 
gateway:
  enabled: true
traces:
  otlp:
    grpc:
      enabled: true
EOF

Grafana

And finally grafana, with datasources provisioned.

cat > grafana-values.yaml <<EOF
service:
  type: LoadBalancer
datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus-server
    - name: Loki
      type: loki
      url: http://loki-read:3100
    - name: Tempo
      type: tempo
      url: http://tempo-gateway
EOF

Setup prerequisites for provisioning volumes

We need to setup a few things on AWS so that we are able to provision volumes that are needed for some of our tools. First we have to install the CSI driver on the cluster.

eksctl create addon --name aws-ebs-csi-driver --cluster otel-auto-java-demo --region $REGION

We have to then the modify the nodegroup's IAM role with appropriate permissions, for which we have to first find the role. Note that the role suffix could be different in your case.

$ aws iam list-roles | jq -r '.Roles[].RoleName' | grep  otel-auto-java | grep node
eksctl-otel-auto-java-demo-nodegr-NodeInstanceRole-tUmFVlpwtUfc

We can then attach the appropriate policy to the role.

$ aws iam attach-role-policy   --role-name eksctl-otel-auto-java-demo-nodegr-NodeInstanceRole-tUmFVlpwtUfc   --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

And then we can set the gp2 storage class as default. This is because the volumes we install would look for the default storage class.

$ kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
storageclass.storage.k8s.io/gp2 patched

The word default should now appear in the storage class.

$ kubectl get storageclass
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  20h

Deploy the tool stack

Alright, done some heavy lifting, we can install our tool stack now.

helmfile sync

Once the installation is complete you can check the release and pods status with commands such as helm ls -A, kubectl get po -A.

Deploy the demo app

We should now deploy the demo app with the following manifest, which has both deployment and service configuration.

cat > demoapp.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-auto-java-demo
  labels:
    app: otel-auto-java-demo
  namespace: otel-auto-java-demo
spec:
  selector:
    matchLabels:
      app: otel-auto-java-demo
  template:
    metadata:
      labels:
        app: otel-auto-java-demo
    spec:
      containers:
        - name: otel-auto-java-demo
          image: $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/otel-auto-java-demo:latest
          # https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/
          env:
          - name: OTEL_EXPORTER_OTLP_ENDPOINT
            value: "http://otel-collector-opentelemetry-collector:4317"
          - name: OTEL_EXPORTER_OTLP_PROTOCOL
            value: grpc
          imagePullPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: otel-auto-java-demo
  namespace: otel-auto-java-demo
spec:
  type: ClusterIP
  selector:
    app: otel-auto-java-demo
  ports:
    - port: 80
      targetPort: 8080
EOF

Note that the above file has env vars for account id and region, hence we can use envsubst before applying the manifest.

envsubst < demoapp.yaml | kubectl apply -f -

Make some api calls to the app

We can expose the service, so that we can send some calls on the localhost.

kubectl port-forward svc/otel-java-demo -n otel-java-demo 8080:80

Make a few api calls with curl, from a different terminal.

$ curl localhost:8080/greeting; echo
{"id":8,"content":"Hello, World!"}

$ curl localhost:8080/greeting?name=Earth; echo
{"id":13,"content":"Hello, Earth!"}

$ curl localhost:8080/greeting?name=Universe; echo
{"id":14,"content":"Hello, Universe!"}

When done, you may stop the port fowarding with Ctrl C.

Monitor the app

We can now monitor our app from Grafana with the LoadBalancer URL of Grafana.

kubectl get svc grafana -n otel-auto-java-demo
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP                                                               PORT(S)        AGE
grafana   LoadBalancer   10.100.69.33   acdbcf3ef71674a0b9a8be554d62db7c-2016095508.us-east-1.elb.amazonaws.com   80:32283/TCP   73m

Just access the URL above with http:// on the browser, the URL would be different in your case. The username is admin, and the password can be obtained from the secret as shown below. The password would be different in your case.

$ kubectl get secret grafana -n otel-auto-java-demo -o jsonpath={.data.admin-password} | base64 -d ; echo
3cOFbY8Q5H5Qz4CHHJvJZ2a2UDvSeQnmRrFIQlsv

Go to explore tab on let's try one query each for metrics, logs and traces.

Metrics preview:

Logs preview:

Traces preview:

Note that we only added two env vars in our demo app, we haven't added any extra vars such as resource attributes, service name etc. Just to show that it would pick it from the artifact id if not provided. As shown in the image previews above, for metrics(prometheus) the it would become the job label, and for logs(loki) it would become the job and service_name label and for traces(tempo) it would be the resource.service.name tag.

$ cat pom.xml
--TRUNCATED--
<groupId>com.example</groupId>
    <artifactId>rest-service-complete</artifactId>
--TRUNCATED--

So, we were able to get the telemetry data successfully from our SpringBoot Java App all the way down to Grafana. With this as base, we could now build custom dashboards as we like, do some correlation by modifying datasource config, etc.

Cleanup checklist

Delete the helm releases: helmfile destroy
Delete the PVCs: kubectl delete pvc --all -n otel-auto-java-demo
Delete the demo app (optional): kubectl delete -f demoapp.yaml
Delete the cluster: eksctl delete cluster --name otel-auto-java-demo --region $REGION
Delete the ECR repo: aws ecr delete-repository --repository-name otel-auto-java-demo --region $REGION --force

Thank you for reading :)

OTEL Auto Instrumentation Demo for SpringBoot