How to monitor and alert on Nginx ingress in Kubernetes

Brian Neville-O'Neill - Feb 10 '23 - - Dev Community

In this blog post, we will discuss how to set up monitoring and alerting for Nginx ingress in a Kubernetes environment.

We will cover the installation and configuration of ingress-nginx, Prometheus, and Grafana, and the setup of alerts for key Ingress metrics.

Pre-requisites :

  • A Kubernetes cluster
  • Helm v3

Install Prometheus and Grafana

In this step, we will install Prometheus to collect metrics, and Grafana to visualize and create alerts based on those metrics.

Let’s install the kube-prometheus-stack helm chart by copying the below-mentioned commands to your terminal. This will set up Grafana, Prometheus, and other monitoring components.

# Add and update the prometheus-community helm repository.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

cat <<EOF | helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--create-namespace -n monitoring -f -

grafana:
  enabled: true
  adminPassword: "admin"
  persistence:
    enabled: true
    accessModes: ["ReadWriteOnce"]
    size: 1Gi
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - grafana.localdev.me
EOF
Enter fullscreen mode Exit fullscreen mode

Let’s see if the installed components are up and running :

kubectl get pods -n monitoring

NAME READY STATUS RESTARTS AGE
kube-prometheus-stack-grafana-7bb55544c9-qwkrg 3/3 Running 0 3m38s
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 3m14s
...
Enter fullscreen mode Exit fullscreen mode

Looks great, let’s proceed to the next section.

Install & configure Ingress Nginx

In this step, we will install and configure Nginx ingress controller and enable metrics that can be scraped by Prometheus.

  1. Let’s install ingress-nginx into our cluster using the command below:
helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace \
  --set controller.metrics.enabled=true \
  --set controller.metrics.serviceMonitor.enabled=true \
  --set controller.metrics.serviceMonitor.additionalLabels.release="kube-prometheus-stack"
Enter fullscreen mode Exit fullscreen mode

Here we’re specifying serviceMonitor.additionalLabels to be release: kube-prometheus-stack so that Prometheus can discover the service monitor and automatically scrape metrics from it.

  1. Once the chart is installed, let’s deploy a sample app podinfo into default namespace.
helm install --wait podinfo --namespace default \
oci://ghcr.io/stefanprodan/charts/podinfo
Enter fullscreen mode Exit fullscreen mode
  1. Now, create an ingress for the deployed podinfo deployment :
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: podinfo-ingress
spec:
  ingressClassName: nginx
  rules :
  - host: podinfo.localdev.me
  defaultBackend:
    service:
      name: podinfo
      port:
        number: 9898
EOF
Enter fullscreen mode Exit fullscreen mode

Let’s understand a bit about the above ingress config :

  • We’re using ingress-nginx as our ingress controller, hence the ingress class is defined as nginx.
  • In the above config, I’ve used the host address for my Ingress as podinfo.localdev.me.
  • The DNS *.localdev.me resolves to 127.0.0.1, hence for any local testing, this DNS can be used without the hassle of adding an entry into /etc/hosts file.
  • Podinfo app serves HTTP API over port 9898, hence it’s specified for the backend port i.e. when the traffic arrives for the domain http://podinfo.localdev.me, it will be forwarded to 9898 of podinfo service.
  1. Next, from your terminal, port-forward the ingress-nginx service so that you can send traffic from your local terminal.
kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80 > /dev/null &
Enter fullscreen mode Exit fullscreen mode
  • Port 80 on the host is a privileged port, so we’re not using that. Instead, we’re binding port 80 of nginx service to 8080 of host machine. You can specify any valid port of your choice.

Note: If you’re running this in any cloud, port-forwarding is not required as LoadBalancer for ingress-nginx service will be auto-created since the service type is defined as LoadBalancer by default.

  1. Now, you can run the below curl request to the podinfo endpoint, which should respond with :
> curl http://podinfo.localdev.me:8080

"hostname": "podinfo-59cd496d88-8dcsx"
"message": "greetings from podinfo v6.2.2"
Enter fullscreen mode Exit fullscreen mode
  1. You can also get the prettier look in the browser with URL : http://podinfo.localdev.me:8080/

Configure Grafana Dashboards for Ingress Nginx monitoring

To access Grafana, you can open the below URL in your browser with the credential admin:admin : http://grafana.localdev.me:8080/.

Copy thenginx.json from here and paste it into http://grafana.localdev.me:8080/dashboard/import to import the dashboard.

Once imported the dashboard should look like this :

Alerting over SLI metrics

Now that we have the dashboard and metrics ready in our Grafana, it’s time to set alerting on important SLI like Error Rate & Latency.

Generate sample loads

In order to get traffic on our my podinfo application, we’ll be using vegeta as a load testing tool. Please install it from here.

Let’s generate a sample HTTP 4xx traffic. To do that, you can run the below command which will run at a request rate of 10 RPS for 10 minutes.

echo "GET http://podinfo.localdev.me:8080/status/400" | vegeta attack -duration=10m -rate=10/s
Enter fullscreen mode Exit fullscreen mode
  • You can just change the status code from 400 to 500 and run as well for test 5xx throughput.

For latency tests, I’ve used the below command as GET /delay/{seconds} waits for the specified period :

echo "GET http://podinfo.localdev.me:8080/delay/3" | vegeta attack -duration=10m -rate=100/s
Enter fullscreen mode Exit fullscreen mode

Note: You can read more on the endpoints available in podinfo app from here.

Grafana Alerting

The newer Grafana comes with its own alerting engine. That helps in keeping all config, rules, and even firing alerts in one place. Let’s configure alerts for common SLI.

4xx Error Rate

  1. Let’s create an alert by going to http://grafana.localdev.me:8080/alerting/new
  2. We can use the following formula to get 4xx error rate percentage :

(total number of 4xx requests / total number of requests) * 100

  1. Add the below expression for the query :
(sum(rate(nginx_ingress_controller_requests{status=~'4..'}[1m])) by (ingress) / sum(rate(nginx_ingress_controller_requests[1m])) by (ingress)) * 100 > 5
Enter fullscreen mode Exit fullscreen mode

  1. In Expression B, use Reduce operation with the function Mean for input A.

  2. In Alert Details, Name the alert as per your liking, I’ve named it Ingress_Nginx_4xx.

  3. For Summary, we can keep it as short as possible, by just displaying the Ingress name with label {{ $labels.ingress }}.

Ingress High Error Rate : 4xx on *{{ $labels.ingress }}*
Enter fullscreen mode Exit fullscreen mode
  1. For Description, I’ve used printf "%0.2f" to display up to two decimals on the percentage value.
4xx : High Error rate : `{{ printf "%0.2f" $values.B.Value }}%` on *{{ $labels.ingress }}*.
Enter fullscreen mode Exit fullscreen mode
  1. Overall alert should look similar to the below snapshot :

  1. In the end, you can add a custom label like severity : critical.

5xx Error Rate

Similar to 4xx alert config, 5xx error rate can also be queried with the below query :

sum(rate(nginx_ingress_controller_requests{status=~'5..'}[1m])) by (ingress,cluster) / sum(rate(nginx_ingress_controller_requests[1m]))by (ingress) * 100 > 5
Enter fullscreen mode Exit fullscreen mode

Note: I’ve configured the alert to be triggered then the 5xx/4xx percentage is > 5%. You can set it as per your error budget.

High Latency (p95)

To calculate the 95th percentile of request durations over the last 15m we can use the nginx_ingress_controller_request_duration_seconds_bucket metric.

It gives you The request processing time in milliseconds and since its a bucket we can use histogram_quantile function.

Similarly create an alert to above examples and use the below query :

histogram_quantile(0.95,sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[15m])) by (le,ingress)) > 1.5
Enter fullscreen mode Exit fullscreen mode

I’ve set the threshold to 1.5 seconds, it can be updated as per your SLO.

High request rate

To get the request rate per second (RPS), we can use the below query :

sum(rate(nginx_ingress_controller_requests[5m])) by (ingress) > 2000
Enter fullscreen mode Exit fullscreen mode

The above query will trigger an alert when the request rate is greater than 2000 RPS.

Other SLIs to consider

Connection rate : This measures the number of active connections to Nginx ingress, and can be used to identify potential issues with connection handling.

rate(nginx_ingress_controller_nginx_process_connections{ingress="ingress-name"}[5m])
Enter fullscreen mode Exit fullscreen mode

Upstream response time : The time it takes for the underlying service to respond to a request, this will help to identify issues with the service and not just the ingress.

histogram_quantile(0.95,sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[15m])) by (le,ingress)) 
Enter fullscreen mode Exit fullscreen mode

Slack Alert Template

To make alert messages meaningful, we can use alert templates in Grafana.

  1. In order to configure them, go to http://grafana.localdev.me:8080/alerting/notifications and create a new template named slack by pasting the below code block :
{{ define "alert_severity_prefix_emoji" -}}
    {{- if ne .Status "firing" -}}
        :white_check_mark:
    {{- else if eq .CommonLabels.severity "critical" -}}
        :fire:
    {{- else if eq .CommonLabels.severity "warning" -}}
        :warning:
    {{- end -}}
{{- end -}}

{{ define "slack.title" -}}
    {{ template "alert_severity_prefix_emoji" . }} {{- .Status | toUpper -}}{{- if eq .Status "firing" }} x {{ .Alerts.Firing | len -}}{{- end }} | {{ .CommonLabels.alertname -}}
{{- end -}}

{{- define "slack.text" -}}
{{- range .Alerts -}}
{{ if gt (len .Annotations) 0 }}
*Summary*: {{ .Annotations.summary}}
*Description*: {{ .Annotations.description }}
Labels: 
{{ range .Labels.SortedPairs }}{{ if or (eq .Name "ingress") (eq .Name "cluster") }}• {{ .Name }}: `{{ .Value }}`
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ end }}
Enter fullscreen mode Exit fullscreen mode
  1. Configure a new contact point of type Slack. For this, you need to create an incoming webhook from Slack. Refer this doc for more detailed steps.
  2. Edit the contact point slack and scroll down and select the option Optional Slack settings.
  3. In the Title, enter the below to specify the template to use:
{{ template "slack.title" . }}
Enter fullscreen mode Exit fullscreen mode
  1. In the Text Body, enter the below and save it :
{{ template "slack.text" . }}
Enter fullscreen mode Exit fullscreen mode
  1. Go to http://grafana.localdev.me:8080/alerting/routes and configure the Default contact point to be Slack.

Finally, the alert message arrives!

After configuring all the steps, finally we arrive at the end, and below are the snapshots of how the alert will look on your slack.

4xx Error Rate :

5xx Error Rate :

Latency P95 :

There are lots of things one can improve according to their requirements. For example, if you have mulltple Kubernetes clusters, you can add a cluster label that will help in identifying the source cluster for the alert.


References

https://grafana.com/docs/grafana/latest/alerting/

https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/

Go microservice template for Kubernetes

https://github.com/stefanprodan/podinfo

1,127 forks.

3,556 stars.

8 open issues.

Recent commits:

Aviator: Automate your cumbersome merge processes

Aviator automates tedious developer workflows by managing git Pull Requests (PRs) and continuous integration test (CI) runs to help your team avoid broken builds, streamline cumbersome merge processes, manage cross-PR dependencies, and handle flaky tests while maintaining their security compliance.

There are 4 key components to Aviator:

  1. MergeQueue – an automated queue that manages the merging workflow for your GitHub repository to help protect important branches from broken builds. The Aviator bot uses GitHub Labels to identify Pull Requests (PRs) that are ready to be merged, validates CI checks, processes semantic conflicts, and merges the PRs automatically.
  2. ChangeSets – workflows to synchronize validating and merging multiple PRs within the same repository or multiple repositories. Useful when your team often sees groups of related PRs that need to be merged together, or otherwise treated as a single broader unit of change.
  3. FlakyBot – a tool to automatically detect, take action on, and process results from flaky tests in your CI infrastructure.
  4. Stacked PRs CLI – a command line tool that helps developers manage cross-PR dependencies. This tool also automates syncing and merging of stacked PRs. Useful when your team wants to promote a culture of smaller, incremental PRs instead of large changes, or when your workflows involve keeping multiple, dependent PRs in sync.

Try it for free.

The post How to monitor and alert on Nginx ingress in Kubernetes first appeared on Aviator Blog.

The post How to monitor and alert on Nginx ingress in Kubernetes appeared first on Aviator Blog.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .