Infrastructure design

Why Auto scaling?
There are no in built availability solutions in Prometheus except prometheus operator and Thanos which are using in Kubernetes. Also since, Prometheus is a pull-based monitoring system it is more easy and suitable for setting up a ASG.
How do we backup running Prometheus data?
As Prometheus fundamentally runs on one machine, some may wish to take backups of their data. With Prometheus 1.x this was a slow and disruptive process, requiring Prometheus to be completely restarted. The good news is that, due to its new storage engine, Prometheus 2.1 has a much better way of doing this.
To use it, you must enable the Admin API endpoints when running Prometheus:
$ ./prometheus --storage.tsdb.path=data/ --web.enable-admin-api
Then you can use a simple HTTP POST request to ask for a snapshot:
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
Here a few seconds later it has returned the name of the new snapshot in a JSON object. If you look under the snapshots directory of your data directory you'll see this snapshot:
$ cd data/snapshots
$ ls
20180119T172548Z-78ec94e1b5003cb
Then copy the snapshot to s3 using aws cli. This process is executing by every 5 minutes, via a crontab.
In the ASG ami we have setup another script to copy s3 data when the server is starting.
And prometheus will restore the backup it self by using following command in the crontab script.
--storage.tsdb.path
Challenges:
- Point push gateway metrics to backup metrics.
- Avoid alert conflictions.
Solutions:
- Setup dns for push gateway
- Use prometheus alertmanager for avoid false alerts.
SLAs, SLOs, SLIs word soup
There's a lot already written about topics:
If you are not familiar with these terms, I would strongly recommend reading the article from Google's SRE book on Service Level Objectives first.
In summary:
- SLAs: Service Level Agreement
- What service you commit to provide to users, with possible penalties if you are not able to meet it.
- Example: "99.5%" availability.
- Keyword: contract
- SLOs: Service Level Objective
- What you have internally set as a target, driving your measuring threshold (for example, on dashboards and alerting). In general, it should be stricter than your SLA.
- Example: "99.9%" availability (the so called "three 9s").
- Keyword: thresholds
- SLIs: Service Level Indicators
- What you actually measure, to assertain whether your SLOs are on/off-target.
- Example: error ratios, latency
- Keyword: metrics
SLOs are about time
So what does 99% availability mean? - It's not 1% of error ratio (percentage of failed http responses), but instead the percentage of time over a predefined period the service has been available.

In the dashboard above, the service went above 0.1% error ratio (0.001 in the y-axis) for 1 hour (the small red horizontal segment on top of the errors spike), thus giving a 99.4% availability over a 7 day period:

A key factor in this result is the time span you choose to measure availability (7 days in above example). Shorter periods are typically used as checkpoints for the engineering teams involved (for example, SRE and SWE) to track how the service is doing, while longer periods are usually used for review purposes by the organization / wider-team.
For example, if you set a 99.9% SLO, then the total time the service can be down would be the following:
- during 30 days: 43 min (3/4 hours)
- during 90 days: 129 min (~2 hours)
Another trivial "numbers fact" is that adding extra 9s to the SLO has an obvious exponential impact. See the following time fractions for a total 1 year period span:
- 2×9s: 99%: 5250min (87hrs or 3.64days)
- 3×9s: 99.9%: 525min (8.7hrs)
- 4×9s: 99.99%: 52.5min
- 5×9s: 99.999%: 5min <- rule of approximation: 5× 9s -> 5 mins (per year)
Enter error budgets
The above numbers for the allowed time a service can be down may be thought of as an error budget, which you consume from events such as the following:
- planned maintenance
- failed upgrades
- unexpected outages
The practical outcome is that any of above will consume error budget from your service, for example, an unexpected outage may deplete it to the point of blocking further maintenance work during that time period.
SLIs are about metrics
From the above, it's clear that we must have service metrics to tell us when the service is considered (un)available. There are several approaches for this:
- RED\: Rate, Errors, Duration - introduced by @tom_wilkie
- USE\: Utilization, Saturation, and Errors - introduced by @brendangregg
Example SLO implementation
Let's take a specific example, following the RED method (as the metrics we already have available are a better match for this approach): create alerts and dashboards to support a target SLO for the Kubernetes API, via tools commonly used for monitoring purposes: [Prometheus] and [Grafana].
Additionally we'll use [jsonnet] to build our rules and dashboards files, taking advantage of existing library helpers.
Rather than explaining how to signal when your service is out of the thresholds, this article focuses on how to record the time the service has been under this condition, as discussed in SLOs are about time section.
The rest of the article will focus on creating Prometheus rules to capture "time out of SLO", based on thresholds for specific metrics (SLIs).
Define the SLO target and metrics thresholds
Let's define a simple target:
- SLO: 99%, from the following:
- SLIs:
- error ratio under 1%
- latency under 200ms for 90th percentile of requests
Writing above spec as jsonnet (see [spec-kubeapi.jsonnet]):
slo:: {
target: 0.99,
error_ratio_threshold: 0.01,
latency_percentile: 90,
latency_threshold: 200,
},
Finding the SLIs
The Kubernetes API exposes several metrics we can use as SLIs, using the Prometheus rate() function over a short period (here we choose 5min, this number should be a few times your scraping interval):
- apiserver_request_count: counts all the requests by verb, code, resource, e.g. to get the total error ratio for the last 5min:
sum(rate(apiserver_request_count{code=~"5.."}[5m]))
/
sum(rate(apiserver_request_count[5m]))
- The formula above discards all metrics labels (for example, by http verb, code). If you want to keep some labels, you'd need to do something similar to the following:
sum by (verb, code) (rate(apiserver_request_count{code=~"5.."}[5m]))
/ ignoring (verb, code) group_left
sum (rate(apiserver_request_count[5m]))
- apiserver_request_latencies_bucket: latency histogram by verb. For example, to get the 90th latency quantile in milliseconds: (note that the le "less or equal" label is special, as it sets the histogram buckets intervals, see [Prometheus histograms and summaries][promql-histogram]):
histogram_quantile (
0.90,
sum by (le, verb, instance)(
rate(apiserver_request_latencies_bucket[5m])
)
) / 1e3
Learn more at:
Writing Prometheus rules to record the chosen SLIs
PromQL is a very powerful language, although as of October 2018, it doesn't yet support nested sub queries for ranges (see Prometheus issue 1227 for details), a feature we'll need to be able to compute time ratio for error ratio or latency outside their thresholds.
Also, as good practice, to lower query-time Prometheus resource usage, it is recommended to always add recording rules to precompute expressions such as sum(rate(...)) anyway.
As an example of how to do this, the following set of recording rules are built from our [bitnami-labs/kubernetes-grafana-dashboards] repository to capture the above time ratio:
- Create a new kubernetes:job_verb_code_instance:apiserver_requests:rate5m metric to record requests rates:
record: kubernetes:job_verb_code_instance:apiserver_requests:rate5m
expr: |
sum by(job, verb, code, instance) (rate(apiserver_request_count[5m]))
- Using above metric, create a new kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m for the requests ratios (over total):
record: kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m
expr: |
kubernetes:job_verb_code_instance:apiserver_requests:rate5m
/ ignoring(verb, code) group_left()
sum by(job, instance) (
kubernetes:job_verb_code_instance:apiserver_requests:rate5m
)
- Using above ratio metrics (for every http code and verb), create a new one to capture the error ratios:
record: kubernetes:job:apiserver_request_errors:ratio_rate5m
expr: |
sum by(job) (
kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m
{code=~"5..",verb=~"GET|POST|DELETE|PATCH"}
)
- Using above error ratios (and other similarly created kubernetes::job:apiserver_latency:pctl90rate5m one for recorded 90th percentile latency over the past 5mins, not shown above for simplicity), finally create a boolean metric to record our SLO complaince:
record: kubernetes::job:slo_kube_api_ok
expr: |
kubernetes:job:apiserver_request_errors:ratio_rate5m < bool 0.01
*
kubernetes::job:apiserver_latency:pctl90rate5m < bool 200
Writing Prometheus alerting rules
The above kubernetes::job:slo_kube_api_ok final metric is very useful for dashboards and accounting for SLO compliance, but we should alert on which of above metrics is driving the SLO off, as shown the following Prometheus alert rules:
- Alert on high API error ratio:
alert: KubeAPIErrorRatioHigh
expr: |
sum by(instance) (
kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m
{code=~"5..",verb=~"GET|POST|DELETE|PATCH"}
) > 0.01
for: 5m
- Alert on high API latency
alert: KubeAPILatencyHigh
expr: |
max by(instance) (
kubernetes:job_verb_instance:apiserver_latency:pctl90rate5m
{verb=~"GET|POST|DELETE|PATCH"}
) > 200
for: 5m
Note that the Prometheus rules are taken from the already manifested jsonnet output, which can be found in [our sources][bitnami-labs/kubernetes-grafana-dashboards] and the thresholds are evaluated from $.slo.error_ratio_threshold and $.slo.latency_threshold respectively.
Programmatically creating Grafana dashboards
Creating Grafana dashboards is usually done by interacting with the UI. This is fine for simple and/or "standard" dashboards (as for example, downloaded from https://grafana.com/dashboards), but becomes cumbersome if you want to implement best devops practices, especially for gitops workflows.
The community is addressing this issue via efforts, such as Grafana libraries for jsonnet, python, and Javascript. Given our jsonnet implementation, we chose grafonnet-lib.
One very useful outcome of using jsonnet to set our SLO thresholds and code our Prometheus rules, is that we can re-use these to build our Grafana dashboards, without having to copy and paste them, that is, we keep a single source of truth for these.
For example:
- referring to $.slo.error_ratio_threshold in our Grafana dashboards to set Grafana graph panel's thresholds property, as we did above for our Prometheus alert rules.
- referring to created Prometheus recorded rules via jsonnet, an excerpt from [spec-kubeapi.jsonnet], note the metric.rules.requests_ratiorate_job_verb_code.record usage (instead of verbatim 'kubernetes:job_verb_code_instance:apiserver_requests:ratio_rate5m'):
// Graph showing all requests ratios
req_ratio: $.grafana.common {
title: 'API requests ratios',
formula: metric.rules.requests_ratiorate_job_verb_code.record,
legend: '{{ verb }} - {{ code }}',
},
You can read our implementation at dash-kubeapi.jsonnet, the following is a screenshot of the resulting dashboard:

Putting it all together
We implemented above ideas in our bitnami-labs/kubernetes-grafana-dashboards repository, under the jsonnet folder.
Our built Prometheus rules and Grafana dashboard files get produced from the jsonnet sources as the following:

- [spec-kubeapi.jsonnet]\: as much data-only specification as possible (thresholds, rules and dashboards formulas)
- rules-kubeapi.jsonnet\: outputs Prometheus recording rules and alerts
- dash-kubeapi.jsonnet\: outputs Grafana dashboards, using grafonnet-lib via our opinionated bitnami_grafana.libsonnet.
Since we started this project, many other useful Prometheus rules have been created by the community. Check srecon17_americas_slides_wilkinson.pdf for more information on this. If we had to start from scratch again, we'd likely be using the kubernetes-mixin together with jsonnet-bundler.
Alert Manager
The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.
The following describes the core concepts the Alertmanager implements. Consult the configuration documentation to learn how to use them in more detail.
Grouping
Grouping categorizes alerts of similar nature into a single notification. This is especially useful during larger outages when many systems fail at once and hundreds to thousands of alerts may be firing simultaneously.
Example: Dozens or hundreds of instances of a service are running in your cluster when a network partition occurs. Half of your service instances can no longer reach the database. Alerting rules in Prometheus were configured to send an alert for each service instance if it cannot communicate with the database. As a result hundreds of alerts are sent to Alertmanager.
As a user, one only wants to get a single page while still being able to see exactly which service instances were affected. Thus one can configure Alertmanager to group alerts by their cluster and alertname so it sends a single compact notification.
Grouping of alerts, timing for the grouped notifications, and the receivers of those notifications are configured by a routing tree in the configuration file.
Inhibition
Inhibition is a concept of suppressing notifications for certain alerts if certain other alerts are already firing.
Example: An alert is firing that informs that an entire cluster is not reachable. Alertmanager can be configured to mute all other alerts concerning this cluster if that particular alert is firing. This prevents notifications for hundreds or thousands of firing alerts that are unrelated to the actual issue.
Inhibitions are configured through the Alertmanager's configuration file.
Silences
Silences are a straightforward way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree. Incoming alerts are checked whether they match all the equality or regular expression matchers of an active silence. If they do, no notifications will be sent out for that alert.
Silences are configured in the web interface of the Alertmanager.
Client behavior
The Alertmanager has special requirements for behavior of its client. Those are only relevant for advanced use cases where Prometheus is not used to send alerts.
High Availability
Alertmanager supports configuration to create a cluster for high availability. This can be configured using the --cluster-* flags.
It's important not to load balance traffic between Prometheus and its Alertmanagers, but instead, point Prometheus to a list of all Alertmanagers.
No comments:
Post a Comment