11 Monitoring ScyllaDB

This chapter covers

Configuring the Scylla monitoring stack against your cluster
Using Prometheus to collect metrics
Viewing dashboards and visualizations of your cluster’s performance using Grafana
Load-testing via cassandra-stress
Diagnosing and remediating common incidents

To run a database in production, you need to know if it’s actually running. The rest of the book is about using Scylla in a way that minimizes the chances of an alert happening, but this chapter is about monitoring your cluster. Not monitoring a database is a great way to never get paged in the middle of the night, but it’s also highly frowned on by users, managers, and about every best practice out there. Here, you’ll learn how to monitor Scylla, observe its performance and generate alerts to clue you in on problems in your cluster.

Ideally, your cluster never has a problem, and you’re never paged. You’ll learn how to load-test Scylla to help determine how much traffic your database can handle, compare it against your expected traffic volume, and size the cluster appropriately. Additionally, a load test is a great way to see the monitoring tools in action; they generate load on the cluster that you can see in dashboards.

11.1 The monitoring stack

11.1.1 Deploying monitoring

11.1.2 Prometheus

11.1.3 Grafana

11.1.4 Alertmanager

11.1.5 Other monitoring needs

11.2 Causing stress with cassandra-stress

11.2.1 Setting up cassandra-stress

11.2.2 Examining performance

11.3 Common incidents

11.3.1 A hot partition

11.3.2 An overwhelmed database

11.3.3 Failing to meet consistency requirements

Summary