EKS & NGINX Load Balancer Monitor with Prometheus, Grafana, and Alerts

Ravindra Singh - Dec 14 - - Dev Community

Introduction:

With the growing use of Kubernetes (EKS) and microservices architecture, monitoring your infrastructure has become crucial to ensure that everything runs smoothly.

  • In this blog, we'll walk through setting up monitoring for NGINX running on an Amazon EKS cluster, utilizing Prometheus and Grafana for monitoring and visualization, and implementing alerting mechanisms to keep you informed about your system's health.

Why Monitor EKS and NGINX Load Balancers?

  • Early Issue Detection: Proactive monitoring allows you to identify potential problems before they impact your users, ensuring a seamless experience.

  • Performance Optimization: Monitoring key metrics helps you understand how your EKS cluster and NGINX load balancer are performing, allowing you to fine-tune your configuration for optimal efficiency.

  • Troubleshooting: When issues arise, having detailed metrics and logs readily available makes troubleshooting faster and more effective.

  • Capacity Planning: Analyzing historical data helps you predict future resource needs, ensuring your infrastructure scales seamlessly with your application.

EKS Cluster Metrics

  • EKS Cluster
  • Node resource utilization (CPU, memory, disk)
  • Pod health and status
  • Control plane health

NGINX Ingress Controller Metrics

  • Request rate and latency
  • Error rate
  • Upstream server health
  • Connection counts

Example Alerting Scenarios

  • High NGINX Error Rate: Alert when the NGINX error rate exceeds a certain threshold, indicating potential issues with upstream servers or configuration.

  • Node Resource Exhaustion: Alert when CPU or memory utilization on any node approaches critical levels, allowing you to proactively scale your cluster.

  • Pod Failures: Alert when pods repeatedly fail to start or crash, signaling potential application or configuration problems.

Let's Begin😎

🚀 Step-by-Step Guide to install the promethus , grafana and alert manager
1️⃣ A running Kubernetes cluster: This can be a self-managed cluster or a managed service like Amazon EKS.

Refer below video to create the EKS Cluster in AWS

2️⃣ NGINX Ingress on AWS EKS and Deploying Sample Applications

Refer below video to setup in AWS

3️⃣ Clone the Repository

🧑🏻‍💻git clone https://github.com/ravindrasinghh/Kubernetes-Playlist.git
👨🏻‍💻cd Kubernetes-Playlist/Lesson1/
Enter fullscreen mode Exit fullscreen mode

4️⃣ Please add the below file to install Prometheus, Grafana, and Alertmanager using Helm
👉🏻 prometheus.tf

resource "helm_release" "kube-prometheus-stack" {
  name             = "kube-prometheus-stack"
  repository       = "https://prometheus-community.github.io/helm-charts"
  chart            = "kube-prometheus-stack"
  version          = "56.21.3"
  namespace        = "monitoring"
  create_namespace = true

  values = [
    file("./prometheus.yaml")
  ]
}
Enter fullscreen mode Exit fullscreen mode

👉🏻 prometheus.yaml

alertmanager:
  enabled: true
  alertmanagerSpec:
    retention: 24h #This setting specifies the time duration for which Alertmanager will retain alert data. 
    replicas: 1
    resources:
      limits:
        cpu: 600m
        memory: 1024Mi
      requests:
        cpu: 200m
        memory: 356Mi
  config:
    global:
      resolve_timeout: 1s # In this case, it's set to 5 minutes. If an alert is not explicitly resolved by the alert source within 5 minutes, it will be automatically marked as resolved by Alertmanager.
    route:
      group_wait: 20s
      group_interval: 1m
      repeat_interval: 30m
      receiver: "null"
      routes:
      - match:
          alertname: Watchdog
        receiver: "null"
      - match:
          severity: warning
        receiver: "slack-alerts"
        continue: true
      - match:
          severity: critical
        receiver: "slack-alerts"
        continue: true
    receivers:
      - name: "null"
      - name: "slack-alerts"
        slack_configs:
        - api_url: 'https://hooks.slack.com/services/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'
          channel: '#prometheus-alerts'
          send_resolved: true
          title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Production Monitoring Event Notification'
          text: >-
            {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
              *Description:* {{ .Annotations.description }}
              *Details:*
              {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
              {{ end }}
            {{ end }}
    templates:
    - "/etc/alertmanager/config/*.tmpl"
additionalPrometheusRulesMap:
 custom-rules:
  groups:
  - name: NginxIngressController
    rules:
    - alert: NginxHighHttp4xxErrorRate
      annotations:
        summary: "High rate of HTTP 4xx errors (instance {{ $labels.ingress }})"
        description: "Too many HTTP requests with status 4xx (> 20 per second) in the last 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      expr: nginx_ingress_controller_requests{status="404", ingress="photoapp"} > 5
      for: 5m
      labels:
        severity: critical      
  - name: Nodes
    rules:
    - alert: KubernetesNodeReady
      expr: sum(kube_node_status_condition{condition="Ready", status="false"}) by (node) > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Kubernetes Node ready (instance {{ $labels.instance }})
        description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"         
      # New alert for node deletion
    - alert: InstanceDown
      expr: up == 0
      labels:
        severity: critical
      annotations:
        summary: Kubernetes Node Deleted (instance {{ $labels.instance }})
        description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"     
  - name: Pods 
    rules: 
      - alert: Container restarted 
        annotations: 
          summary: Container named {{$labels.container}} in {{$labels.pod}} in {{$labels.namespace}} was restarted 
          description: "\nCluster Name: {{$externalLabels.cluster}}\nNamespace: {{$labels.namespace}}\nPod name: {{$labels.pod}}\nContainer name: {{$labels.container}}\n" 
        expr: | 
          sum(increase(kube_pod_container_status_restarts_total{namespace!="kube-system",pod_template_hash=""}[1m])) by (pod,namespace,container) > 0 
        for: 0m 
        labels: 
          severity: critical            
prometheus:
  enabled: true
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - prometheus.codedevops.cloud
    paths:  
      - /
  prometheusSpec:
    retention: 48h
    replicas: 2
    resources:
      limits:
        cpu: 800m
        memory: 2000Mi
      requests:
        cpu: 100m
        memory: 200Mi
grafana:
  enabled: true
  adminPassword: admin@123
  replicas: 1
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - grafana.codedevops.cloud
    path: /

Enter fullscreen mode Exit fullscreen mode

You can also check the logs of the Prometheus, Grafana, and Alertmanager pods to verify that the setup has been successfully installed.
👉🏻 kubectl get pods -n monitoring

Image description

👉🏻 Let's create a record in Route 53 to access prometheus and grafana via a custom domain.

  1. Go to the Route 53 service, select the hosted zone, and click Create Record.
  2. Choose Alias, then select the region and the Load Balancer ARN, and click Create.

Image description


Image description

👉🏻 once the Ingress is configured, you can access the promethues web interface by navigating to https://prometheus.codedevops.cloud.

Image description

👉🏻 once the Ingress is configured, you can access the grafana web interface by navigating to https://grafana.codedevops.cloud

Image description

To log in:

  • Get the initial password for the admin user:
kubectl get secret kube-prometheus-stack-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 --decode
Enter fullscreen mode Exit fullscreen mode

🚀 Step-by-Step Guide to integrate slack with prometheus alerts
Step 1: Create a Slack Incoming Webhook

Go to Slack Apps: (Open your Slack workspace and visit the Slack API Apps page.)

👉🏻Click on Add Apps located at the bottom.

Image description

👉🏻 It will open this page, where you need to select Incoming Webhook.

Image description

👉🏻 click on it and then click on configuration 

Image description

👉🏻 click on Add to slack

Image description

👉🏻 Choose a channel or an existing channel, then click on Add Incoming Webhooks Integration.After that, copy the Webhook URL and configure it in your Prometheus Helm chart values.yaml file.

Image description

👉🏻 click on save buttons and test the configuration.

🚀 Step-by-Step Guide to monitor NGINX ingress Load balancer

👉🏻Here is the NGINX Ingress YAML file. You can refer to lesson 2 or the previous video for more details.
https://github.com/ravindrasinghh/Kubernetes-Playlist/blob/master/Lesson2/nginx.yaml

controller:
  replicaCount: 2
  minAvailable: 1
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 256Mi
  autoscaling:
    enabled: true
    annotations: {}
    minReplicas: 2
    maxReplicas: 6
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80
    behavior:
      scaleDown:
        stabilizationWindowSeconds: 60
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  metrics:
    port: 10254
    enabled: true
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: "kube-prometheus-stack"
    service:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "10254"
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
      service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: 'true'
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      service.beta.kubernetes.io/aws-load-balancer-ssl-ports: https
      service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: 'ELBSecurityPolicy-TLS13-1-2-2021-06'
      service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
    targetPorts:
      http: http
      https: http

Enter fullscreen mode Exit fullscreen mode

Image description

In simpler terms:
This configuration tells the Ingress Controller to:

  1. Expose its internal metrics on port 10254.
  2. Create a ServiceMonitor object so that Prometheus can automatically find this metrics endpoint.
  3. Add annotations to the Ingress Controller's Service so Prometheus knows to scrape it.

Annotations:

  1. prometheus.io/scrape: "true": This annotation tells Prometheus that it should scrape the metrics from this service.
  2. prometheus.io/port: "10254":Specifies the port that Prometheus should use to scrape metrics from this service, matching the metrics.port setting above.

🚀 Step-by-Step Guide to setup alert in NGINX ingress Load balancer
👉🏻add below configuration inside the additionalPrometheusRulesMap:

additionalPrometheusRulesMap:
 custom-rules:
  groups:
  - name: NginxIngressController
    rules:
    - alert: NginxHighHttp4xxErrorRate
      annotations:
        summary: "High rate of HTTP 4xx errors (instance {{ $labels.ingress }})"
        description: "Too many HTTP requests with status 4xx (> 20 per second) in the last 5 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
      expr: nginx_ingress_controller_requests{status="404", ingress="photoapp"} > 5
      for: 5m
      labels:
        severity: critical 
Enter fullscreen mode Exit fullscreen mode

👉🏻 Here is the correct url :
https://myapp.codedevops.cloud/ping.
👁️‍🗨️Try accessing the incorrect URL https://myapp.codedevops.cloud/pingtesting. You will see two alerts in the firing state, and the alerts will also appear in Slack.

Image description

👉🏻erorr notification in slack

Image description

👉🏻Resolved notification in slack

Image description

🚀 Step-by-Step Guide to monitor Kubernetes cluster, nodes and pods

The metrics-server is required to collect resource metrics like CPU and memory usage for Kubernetes nodes and pods. It provides data for kubectl top commands and enables the cluster's auto-scaling.

Please add the below file to install metrics server using helm.
👉🏻 metrics-server.tf

resource "helm_release" "metrics_server" {
  name       = "metrics-server"
  repository = "https://kubernetes-sigs.github.io/metrics-server/"
  chart      = "metrics-server"
  version    = "3.12.0"
  namespace  = "kube-system"
}
Enter fullscreen mode Exit fullscreen mode

🚀 Step-by-Step Guide to setup alert in Nodes and pods.
node is terminating

additionalPrometheusRulesMap:
 custom-rules:
  groups:  
  - name: Nodes
    rules:
    - alert: KubernetesNodeReady
      expr: sum(kube_node_status_condition{condition="Ready", status="false"}) by (node) > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Kubernetes Node ready (instance {{ $labels.instance }})
        description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"         
      # New alert for node deletion
    - alert: InstanceDown
      expr: up == 0
      labels:
        severity: critical
      annotations:
        summary: Kubernetes Node Deleted (instance {{ $labels.instance }})
        description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}" 
pod is restarting 
additionalPrometheusRulesMap:
 custom-rules:
  groups:
  - name: Pods 
    rules: 
    - alert: Container restarted 
      annotations: 
        summary: Container named {{$labels.container}} in {{$labels.pod}} in {{$labels.namespace}} was restarted 
        description: "\nCluster Name: {{$externalLabels.cluster}}\nNamespace: {{$labels.namespace}}\nPod name: {{$labels.pod}}\nContainer name: {{$labels.container}}\n" 
      expr: | 
        sum(increase(kube_pod_container_status_restarts_total{namespace!="kube-system",pod_template_hash=""}[1m])) by (pod,namespace,container) > 0 
      for: 0m 
      labels: 
       severity: critical
Enter fullscreen mode Exit fullscreen mode

👉🏻erorr notification in slack

Image description

👉🏻Resolved notification in slack

Image description

🚀 Step-by-Step Guide to setup Dashboard in Grafana.
Cluster Monitoring: https://github.com/ravindrasinghh/Kubernetes-Playlist/blob/master/Lesson5/cluster_disk_monitoring.json

Image description

Image description

  1. Prometheus Alert: https://github.com/ravindrasinghh/Kubernetes-Playlist/blob/master/Lesson5/promethues_alert.json

Troubleshooting
If you encounter any issues, refer to the Prometheus documentation or raise an issue in this repository.
🏴‍☠️ source link: https://github.com/ravindrasinghh/Kubernetes-Playlist/tree/master/Lesson5

If you prefer a video tutorial to help guide you to setup EKS & NGINX Load Balancer Monitor with Prometheus, Grafana, and Alerts

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .