Devopsdays NYC 2020 Demo, Open Space Recap & More

Avthar Sewrathan - Mar 18 '20 - - Dev Community

Learn about the latest devopsdays event, get our demo, answers to community questions, and more.

(This post was originally published on the Timescale Blog on March 13, 2020.)

We recently attended the NYC installment of the devopsdays event series (thank you to the local organizers and volunteers!), where we met with community members interested in all things monitoring, infrastructure, software development, and CI/CD.

Given the cancellation of many industry events to ensure public safety and mitigate COVID-19’s spread (check out our blog post if you’re interested in monitoring it yourself), we’re sharing a bit about our recent experience – what we learned, what we demoed, and what we spoke about – to bring the event experience to the wider community.

The Demo

During the event, I demoed how to use TimescaleDB as a long-term store for Prometheus metrics - combining Prometheus, TimescaleDB, and Grafana to monitor a piece of critical infrastructure (in this case, a database). This sort of create-your-own flexibility and customization is becoming more and more common in the conversations I have with developers, and this demo allows you to create a monitoring stack that suits your needs, without adding significant costs.

Why this scenario? I was inspired by one of our Timescale Cloud customers, who uses TimescaleDB to store and analyze their Prometheus metrics. They told us how it not only saves them money and disk space, but it also allows them to keep their data around and see trends over longer time periods.

See the demo in action below:

You’ll notice a Grafana dashboard visualizing metrics, with TimescaleDB as the data source powering the dashboard. I focused on the below basic monitoring metrics, but if you try it yourself, you can customize and add more metrics that give you more insight (e.g., query latency, queries per second, open locks, cache hits, etc.):

  • CPU usage
  • Service status
  • % of Disk used
  • # of Database connections
  • % Memory used
  • Network Status

To replicate the demo, follow these tutorials on how to store Prometheus metrics in Timescale and how to use Timescale as a datasource to power Grafana dashboards.

Open Space: DevOps & Data

Mat leading Open Space on DevOps and Data

Devopsdays “Open Spaces” are a (wonderful) concept similar to an unconference format: there’s a block of time scheduled for any attendees to discuss topics of their choosing with other interested attendees. Simply propose a topic to the audience that you’d like to discuss for 30 mins and other attendees can pick and choose which sessions they’d like to attend.

Fellow Timescaler Matvey Arye and I hosted an Open Space session about DevOps Data, and other topics ranged from negotiating pay and other soft skills to DevOps in small companies and DevOps in a certain ecosystem (AWS, Microsoft Azure, Google Cloud, etc.).

In our session, we heard stories, best practices, and the ways developers from all industries and areas think about the DevOps data they collect.

A few highlights and commonalities

Teams are moving away from managing infrastructure themselves and toward managed services (as one person put it: “One of the key criteria when we select a new tool is that we want one less thing to manage”).

DevOps at certain companies can be a lonely and isolating job. To remedy that, folks mentioned that they’d joined (and recommend!) a few Slack workspaces: O11y.slack.com, HangOps and Coffee Ops.

Data is becoming increasingly central in how teams fuel their post-mortem problem analysis. Developers collect data about critical incidents, search for patterns in what’s causing them, and correlate this information with how it impacts clients or users.

One team’s best practice and advice (they manage a massive consumer messaging app): Take snapshots of high load periods. This way, you get more detailed information to use for planning and to calibrate for the following years. In this team’s case, the New Year’s Eve timeframe is when they see the highest number of messages sent across their global user base.

Kubernetes, as always, was a hot topic. Two common pain points stood out (and are things that we can relate to as we build our Kubernetes deployment and multi-node offerings):

  1. Visibility about what’s happening inside clusters and pods. Someone summed it up with, “I don’t just want to know my pod is offline, I want to know what was going on inside it.” We couldn’t agree more.
  2. Aggregate observability data across clusters to simplify things for Ops teams who handle metrics from multiple applications teams.

Questions & Conversations

To me, the best part of any conference are the hallway conversations and hearing the things community members are keen to learn. As a company, we’re help-first, so, in the spirit of helping, here are a few questions I heard again and again that may be relevant as you get up and running, or do more advanced things with TimescaleDB:

How does TimescaleDB perform at scale?

TimescaleDB scales up well within a single node, and also offers scale-out capabilities if you use our multi-node beta.

In our internal benchmarks on standard cloud VMs, we regularly test TimescaleDB to 10+ billion rows, while sustaining insert rates of 100-200k rows per second (1-2 million metric inserts / second). While running on more powerful hardware, we’ve seen users scale a single-node setup to 500 billion rows of data, while sustaining 400k row inserts per second. To learn more about how TimescaleDB is architected to achieve this scale, see this blog explainer.

And, in our internal tests, a multi-node beta setup with 9 nodes achieved an insert rate of over 12 million metrics per second (and you can read more about our multi-node benchmarking here).

What’s the role of a long-term data store? What types of things does this allow me to do?

In order to keep Prometheus simple and easy to operate, its creators intentionally left out some of the scaling features developers typically need. Prometheus stores data locally within the instance and is not replicated. While having both compute and data storage on one node makes it easier to operate, it also makes it harder to scale and ensure high availability.

More specifically, this means Prometheus data isn’t arbitrarily scalable or durable in the face of disk or node outages.

Simply put, Prometheus isn’t designed to be a long-term metrics store. However, its creators also made Prometheus extremely extensible, and, thus, you can use TimescaleDB to store metrics for longer periods of time, which helps with capacity planning and system calibration. This combination also enables high availability and provides advanced capabilities and features, such as full SQL, joins and replication (things not available in Prometheus). To learn more, see why use TimescaleDB and Prometheus.

How do I use TimescaleDB and Prometheus? Do I have to use any special connectors?

Check out the demo :). I suggest using TimescaleDB as a remote read and write for Prometheus metrics, whether they’re infrastructure for an internal system or your public-facing eCommerce website. Since TimescaleDB extends Postgres, you use the pg_prometheus extension for Postgres and our prometheus_postgresql_adapter, and you’re ready to get started.

Whatever works with Postgres works with TimescaleDB, so, if you want to connect to viz tools (like Grafana or Tableau), ingest data from places like Kafka or insert and analyze data using your favorite programming language (like Python or Go), just use one of the many connectors and libraries in the Postgres ecosystem.

Want to learn more?

Thank you again to the devopsdays NYC team for your work to pull off such an interactive, fun, and community-first event! We’ll definitely be attending as future events are announced (virtually or otherwise).

In the meantime, those resources once more:

...and, in the event you’d like to see an advanced version of this demo and/or are keen to join some #remote-friendly events, you can join me on March 25 at 12 ET for “How to Analyze Your Prometheus Data in SQL: 3 Queries You Need to Know.”

  • I’ll focus on code and showing vs. telling: You’ll learn how to write custom SQL queries to analyze infrastructure monitoring metrics and create Grafana visualizations to see trends, and I’ll answer any questions that you may have.
  • Interested? Sign up here. You’ll receive the recording and resources shortly following the session, so register even if you can’t attend live.
. . . . . . . . . . . . . . . . . . . . . . . . . .