Introduction:
With the growing use of Kubernetes (EKS) and microservices architecture, monitoring your infrastructure has become crucial to ensure that everything runs smoothly.
- In this blog, we'll walk through setting up monitoring for NGINX running on an Amazon EKS cluster, utilizing Prometheus and Grafana for monitoring and visualization, and implementing alerting mechanisms to keep you informed about your system's health.
Why Monitor EKS and NGINX Load Balancers?
Early Issue Detection: Proactive monitoring allows you to identify potential problems before they impact your users, ensuring a seamless experience.
Performance Optimization: Monitoring key metrics helps you understand how your EKS cluster and NGINX load balancer are performing, allowing you to fine-tune your configuration for optimal efficiency.
Troubleshooting: When issues arise, having detailed metrics and logs readily available makes troubleshooting faster and more effective.
Capacity Planning: Analyzing historical data helps you predict future resource needs, ensuring your infrastructure scales seamlessly with your application.
EKS Cluster Metrics
- EKS Cluster
- Node resource utilization (CPU, memory, disk)
- Pod health and status
- Control plane health
NGINX Ingress Controller Metrics
- Request rate and latency
- Error rate
- Upstream server health
- Connection counts
Example Alerting Scenarios
High NGINX Error Rate: Alert when the NGINX error rate exceeds a certain threshold, indicating potential issues with upstream servers or configuration.
Node Resource Exhaustion: Alert when CPU or memory utilization on any node approaches critical levels, allowing you to proactively scale your cluster.
Pod Failures: Alert when pods repeatedly fail to start or crash, signaling potential application or configuration problems.
Let's Begin😎
🚀 Step-by-Step Guide to install the promethus , grafana and alert manager
1️⃣ A running Kubernetes cluster: This can be a self-managed cluster or a managed service like Amazon EKS.
Refer below video to create the EKS Cluster in AWS
2️⃣ NGINX Ingress on AWS EKS and Deploying Sample Applications
Refer below video to setup in AWS
3️⃣ Clone the Repository
🧑🏻💻git clone https://github.com/ravindrasinghh/Kubernetes-Playlist.git
👨🏻💻cd Kubernetes-Playlist/Lesson1/
4️⃣ Please add the below file to install Prometheus, Grafana, and Alertmanager using Helm
👉🏻 prometheus.tf
resource "helm_release" "kube-prometheus-stack" {
name = "kube-prometheus-stack"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
version = "56.21.3"
namespace = "monitoring"
create_namespace = true
values = [
file("./prometheus.yaml")
]
}
👉🏻 prometheus.yaml
alertmanager:
enabled: true
alertmanagerSpec:
retention: 24h #This setting specifies the time duration for which Alertmanager will retain alert data.
replicas: 1
resources:
limits:
cpu: 600m
memory: 1024Mi
requests:
cpu: 200m
memory: 356Mi
config:
global:
resolve_timeout: 1s # In this case, it's set to 5 minutes. If an alert is not explicitly resolved by the alert source within 5 minutes, it will be automatically marked as resolved by Alertmanager.
route:
group_wait: 20s
group_interval: 1m
repeat_interval: 30m
receiver: "null"
routes:
- match:
alertname: Watchdog
receiver: "null"
- match:
severity: warning
receiver: "slack-alerts"
continue: true
- match:
severity: critical
receiver: "slack-alerts"
continue: true
receivers:
- name: "null"
- name: "slack-alerts"
slack_configs:
- api_url: 'https://hooks.slack.com/services/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'
channel: '#prometheus-alerts'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Production Monitoring Event Notification'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
templates:
- "/etc/alertmanager/config/*.tmpl"
additionalPrometheusRulesMap:
custom-rules:
groups:
- name: NginxIngressController
rules:
- alert: NginxHighHttp4xxErrorRate
annotations:
summary: "High rate of HTTP 4xx errors (instance {{ $labels.ingress }})"
description: "Too many HTTP requests with status 4xx (> 20 per second) in the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
expr: nginx_ingress_controller_requests{status="404", ingress="photoapp"} > 5
for: 5m
labels:
severity: critical
- name: Nodes
rules:
- alert: KubernetesNodeReady
expr: sum(kube_node_status_condition{condition="Ready", status="false"}) by (node) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# New alert for node deletion
- alert: InstanceDown
expr: up == 0
labels:
severity: critical
annotations:
summary: Kubernetes Node Deleted (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: Pods
rules:
- alert: Container restarted
annotations:
summary: Container named {{$labels.container}} in {{$labels.pod}} in {{$labels.namespace}} was restarted
description: "\nCluster Name: {{$externalLabels.cluster}}\nNamespace: {{$labels.namespace}}\nPod name: {{$labels.pod}}\nContainer name: {{$labels.container}}\n"
expr: |
sum(increase(kube_pod_container_status_restarts_total{namespace!="kube-system",pod_template_hash=""}[1m])) by (pod,namespace,container) > 0
for: 0m
labels:
severity: critical
prometheus:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
hosts:
- prometheus.codedevops.cloud
paths:
- /
prometheusSpec:
retention: 48h
replicas: 2
resources:
limits:
cpu: 800m
memory: 2000Mi
requests:
cpu: 100m
memory: 200Mi
grafana:
enabled: true
adminPassword: admin@123
replicas: 1
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.codedevops.cloud
path: /
You can also check the logs of the Prometheus, Grafana, and Alertmanager pods to verify that the setup has been successfully installed.
👉🏻 kubectl get pods -n monitoring
👉🏻 Let's create a record in Route 53 to access prometheus and grafana via a custom domain.
- Go to the Route 53 service, select the hosted zone, and click Create Record.
- Choose Alias, then select the region and the Load Balancer ARN, and click Create.
👉🏻 once the Ingress is configured, you can access the promethues web interface by navigating to https://prometheus.codedevops.cloud.
👉🏻 once the Ingress is configured, you can access the grafana web interface by navigating to https://grafana.codedevops.cloud
To log in:
- Get the initial password for the admin user:
kubectl get secret kube-prometheus-stack-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 --decode
🚀 Step-by-Step Guide to integrate slack with prometheus alerts
Step 1: Create a Slack Incoming Webhook
Go to Slack Apps: (Open your Slack workspace and visit the Slack API Apps page.)
👉🏻Click on Add Apps located at the bottom.
👉🏻 It will open this page, where you need to select Incoming Webhook.
👉🏻 click on it and then click on configuration
👉🏻 click on Add to slack
👉🏻 Choose a channel or an existing channel, then click on Add Incoming Webhooks Integration.After that, copy the Webhook URL and configure it in your Prometheus Helm chart values.yaml file.
👉🏻 click on save buttons and test the configuration.
🚀 Step-by-Step Guide to monitor NGINX ingress Load balancer
👉🏻Here is the NGINX Ingress YAML file. You can refer to lesson 2 or the previous video for more details.
https://github.com/ravindrasinghh/Kubernetes-Playlist/blob/master/Lesson2/nginx.yaml
controller:
replicaCount: 2
minAvailable: 1
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
autoscaling:
enabled: true
annotations: {}
minReplicas: 2
maxReplicas: 6
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 60
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
metrics:
port: 10254
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
release: "kube-prometheus-stack"
service:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "10254"
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: 'true'
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: https
service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: 'ELBSecurityPolicy-TLS13-1-2-2021-06'
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
targetPorts:
http: http
https: http
In simpler terms:
This configuration tells the Ingress Controller to:
- Expose its internal metrics on port 10254.
- Create a ServiceMonitor object so that Prometheus can automatically find this metrics endpoint.
- Add annotations to the Ingress Controller's Service so Prometheus knows to scrape it.
Annotations:
- prometheus.io/scrape: "true": This annotation tells Prometheus that it should scrape the metrics from this service.
- prometheus.io/port: "10254":Specifies the port that Prometheus should use to scrape metrics from this service, matching the metrics.port setting above.
🚀 Step-by-Step Guide to setup alert in NGINX ingress Load balancer
👉🏻add below configuration inside the additionalPrometheusRulesMap:
additionalPrometheusRulesMap:
custom-rules:
groups:
- name: NginxIngressController
rules:
- alert: NginxHighHttp4xxErrorRate
annotations:
summary: "High rate of HTTP 4xx errors (instance {{ $labels.ingress }})"
description: "Too many HTTP requests with status 4xx (> 20 per second) in the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
expr: nginx_ingress_controller_requests{status="404", ingress="photoapp"} > 5
for: 5m
labels:
severity: critical
👉🏻 Here is the correct url :
https://myapp.codedevops.cloud/ping.
👁️🗨️Try accessing the incorrect URL https://myapp.codedevops.cloud/pingtesting. You will see two alerts in the firing state, and the alerts will also appear in Slack.
👉🏻erorr notification in slack
👉🏻Resolved notification in slack
🚀 Step-by-Step Guide to monitor Kubernetes cluster, nodes and pods
The metrics-server is required to collect resource metrics like CPU and memory usage for Kubernetes nodes and pods. It provides data for kubectl top commands and enables the cluster's auto-scaling.
Please add the below file to install metrics server using helm.
👉🏻 metrics-server.tf
resource "helm_release" "metrics_server" {
name = "metrics-server"
repository = "https://kubernetes-sigs.github.io/metrics-server/"
chart = "metrics-server"
version = "3.12.0"
namespace = "kube-system"
}
🚀 Step-by-Step Guide to setup alert in Nodes and pods.
node is terminating
additionalPrometheusRulesMap:
custom-rules:
groups:
- name: Nodes
rules:
- alert: KubernetesNodeReady
expr: sum(kube_node_status_condition{condition="Ready", status="false"}) by (node) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# New alert for node deletion
- alert: InstanceDown
expr: up == 0
labels:
severity: critical
annotations:
summary: Kubernetes Node Deleted (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
pod is restarting
additionalPrometheusRulesMap:
custom-rules:
groups:
- name: Pods
rules:
- alert: Container restarted
annotations:
summary: Container named {{$labels.container}} in {{$labels.pod}} in {{$labels.namespace}} was restarted
description: "\nCluster Name: {{$externalLabels.cluster}}\nNamespace: {{$labels.namespace}}\nPod name: {{$labels.pod}}\nContainer name: {{$labels.container}}\n"
expr: |
sum(increase(kube_pod_container_status_restarts_total{namespace!="kube-system",pod_template_hash=""}[1m])) by (pod,namespace,container) > 0
for: 0m
labels:
severity: critical
👉🏻erorr notification in slack
👉🏻Resolved notification in slack
🚀 Step-by-Step Guide to setup Dashboard in Grafana.
Cluster Monitoring: https://github.com/ravindrasinghh/Kubernetes-Playlist/blob/master/Lesson5/cluster_disk_monitoring.json
- Prometheus Alert: https://github.com/ravindrasinghh/Kubernetes-Playlist/blob/master/Lesson5/promethues_alert.json
Troubleshooting
If you encounter any issues, refer to the Prometheus documentation or raise an issue in this repository.
🏴☠️ source link: https://github.com/ravindrasinghh/Kubernetes-Playlist/tree/master/Lesson5
If you prefer a video tutorial to help guide you to setup EKS & NGINX Load Balancer Monitor with Prometheus, Grafana, and Alerts