Using Grafana & Prometheus

Nikolai Main - Nov 2 - - Dev Community

Seeing as Grafana and Prometheus go hand in hand, I'll include them in the same post and keep each section relatively short. I'll provide a brief overview of each tool along with some basic getting started instructions. The finer details can, of course, be found in their respective documentation. Additionally, I'll briefly cover setting up alerts with Grafana and Slack.

image.png


Prometheus

Prometheus is a monitoring and alerting tool providing a wealth of insights for almost any system. It records time-series data, or simply put a series of changes to a given metric over a given time period. Common measurements include request times, active connections, resource usage etc.

Prometheus' main components are:

  • Prometheus Server: Responsible for 'scraping' time series data.
  • Client Libraries: Used in instrumenting application code.
  • Alert Manager: Alert handling

Prometheus supports machine based metrics as well as dynamic service oriented architectures. Since each prometheus server is standalone you can rely on it to diagnose issues with your other systems that may experience outages.

Using Prometheus

Collecting metrics is done via process called 'scraping'. A metric source, or 'instance', usually corressponds to a single process. Once you link an instance to your prometheus server it will begin collecting data based on a configuration you define in a yaml file.

An example Prometheus configuration could look like:

global:

  scrape_interval: 15s

  evaluation_interval: 15s

alerting:

  alertmanagers:

    - static_configs:

        - targets:

           - alertmanager:9093

rule_files:

   - rules.yml

scrape_configs:

  - job_name: "prometheus"

    static_configs:

      - targets: ["localhost:9090"]

  - job_name: node_exporter

    static_configs:

      - targets: ["localhost:9100"]
Enter fullscreen mode Exit fullscreen mode

Notes:

  • Global Config
    • Contains settings like scrape_interval, evaluation interval etc..
  • Alerting
    • States where alerts are sent for further handling.
  • Rules
    • Files containing the metric thresholds that trigger alerts.
  • Scrape Configs
    • A list of every location from which metrics are gathered.

There are several other configuration settings that can be defined however I won't cover them in this post.

Metric Types

Prometheus supports four different metric types:

  • Counter: A single value that can only increase or be reset to zero. A simple metric that can be used for counting network requests.
  • Gauge: A single value that can both increase and decrease. Commonly used for counting pods in a K8 cluster or events in an SQS queue.
  • Histogram: Collects data in several quantiles, or 'buckets', defined by the user. Each value recorded into the histogram will increment all buckets where the value is less than or equal to the value of the bucket. For example, You have the buckets: 0.1, 0.5, 1 and 5. If a value of 0.3 is recorded buckets 0.5, 1 and 5 will be increased. From here you can deduce the distribution of values by subtracting the value of the previous bucket from the current:
    • 0.1 has a value of 0 and has no preceding bucket , Therefore it has no recorded values.
    • 0.5 has a value of 1 and the preceding bucket has a value of 0, totalling 1.
    • 1 has a value of 1 and the preceding bucket a value of 1, totalling 0.
    • 5 has a value of 1, preceding bucket also 1, totalling 0. As such, 100% of all values fall under the 0.5 quantile.

Now an example with a single value doesn't really demonstrate a histogram's utility, but take an organization required to serve 95% of all network requests within 300ms. You could set up a bucket with a value of 0.3 and an alert to trigger if the number of requests in that percentile drops below 0.95, enabling you to investigate the issue and notify relevant parties.

  • Summary: Similar function to histograms but data isn't collected in buckets and instead quantiles are created and estimated by the prometheus server.

General Usage

  1. Configure metric collection locations.
    1. If using Kubernetes, Prometheus can automatically scrape clusters to retrieve resource type metrics.
    2. If you wish to collect application specific metrics like HTTP request failures, You need to confugure instrumentation within the application using the relevant metrics library.
  2. Define scrape config and any other parameters (as described above) in the prometheus.yml file.
  3. Access your prometheus dashboard (localhost:9090 by default). From here you can run queries and view the data as simple graphs.

Prometheus also provides the ability to create alerts however, The process of creating and handling alerts is much more intuitive and streamlined in Grafana.


Grafana

Grafana is a monitioring and visulization tool that can ingest data from a wide range of sources like MySQL, MongoDB, Postgres etc.. to Kubernetes, AWS, Docker and seemingly everything else.

Grafana offers several features.

  • Panels and Visualisation: These are composed of a query - a specific metric from a data source - and a visualization. Grafana offers several types of visualization from standard line time-series graphs to heatmaps.
  • Dashboards: These are a set of one or more panels that provide a quick overview of the information related to a specific metric or data source.
  • Alerting: Define alerts based on metrics from your data sources and send alerts via email or to other messaging solutions like Slack.
  • Querying: Since Grafana can ingest data from several different sources, many of which use different querying languages, They offer a powerful query engine that allows you to create custom, complex queries.

image.png

Security

Authentication, Authorization

Firstly there are 3 types of roles in Grafana:

  • Viewer: Has read-only access to dashboards and panels.
  • Editior: Can create, edit and delete dashboards. Also has access to annotations and alerting.
  • Admin: Full access to Grafana instance. All of the above and user base control.

Grafana provides the ability to configure basic authentication and authorization within the app. You can create individual users with custom userpass credentials

Alternatively you can use an existing federation or implement one of several oAuth providers like Google or Github.

If you have certain teams within your organization that should only see data relevant to them you can create groups to enable this.

Best Practices

If you are running other services on the same server where Grafana is running it's best to incorporate other measures to secure your applications.

  • Configure Grafana to only allow data from trusted sources.
  • Restrict Grafana's communication with other services running on the server.
  • Configure a proxy server to filter all network traffic.
  • Those with the viewer role can query any data source in an organization so it's important to be selective with the data you expose.  
  • Disallow anonymous access.

Basic Usage

  1. Start the Grafana server. Assuming you're self hosting and not using Grafana cloud, Access the console at localhost:3000
  2. Add a new data source. If you're using Grafana for the first time you should have a panel with some getting started buttons. Otherwise click the menu on the left hand side > connections > data sources
    1. Enter connection details. If using prometheus on default settings enter http://localhost:9090.
  3. Create a dashboard
    1. In the query search box find the relevant metric. image.png
    2. Click run queries and your data should show up in the panel above.
  4. Once done click save.

Alerting

As mentioned above, Grafana also provides alerting capabilities similar to Prometheus. The process however, is much more simple as it can all be done in Grafana and doesn't require the configuration of another component. I'll explain briefly how to send alerts to Slack:

Slack

  1. Create a Slack workspace.
  2. Go to Slack API and create an app. (https://api.slack.com/) image.png
  3. Click on incoming webhooks and enable them.
    1. Click add a new webhook, select your workspace and relevant channel.
    2. Copy the webhook url.

Grafana

  1. Go to the Grafana console and from the menu on the left and select alerting > Contact points
    1. Create a contact point > Select slack under integrations > Enter webhook URL.
  2. From the same menu, Alerting > alert rules. Create a new alert rule.
    1. Enter a name
    2. Define query and alert condition
      1. Select the metric you wish to create an alert for. image.png
      2. Create an expression. For example, your metric value crossing a certain theshold. image.png
      3. Set it as the alert condition.
    3. Evaluation behavior.
      1. If no rule folders exist, create one.
      2. Similarly, if no evaluation groups exist, create one.
      3. Set the pending period. For the purpose of this set it to none.
    4. Configure labels and notifications.
      1. For simplicity just select the Slack channel as the contact point.
      2. Don't bother with labels for now.
    5. Enter a summary and description for the alert if necessary.

Now trigger your alert and you should see a notification in your Slack channel.

Final Notes

Keeping track of all types of metrics within your infrastructure, applications, pipeline, etc., is a vital practice in ensuring your entire system stays healthy. As such, the pairing of Prometheus and Grafana provides an excellent solution to this problem. Prometheus excels in collecting and storing time-series data, offering powerful querying capabilities, while Grafana enhances this by providing intuitive and customizable dashboards for visualizing these metrics.

. . . . . . . .