Build monitoring system base on prometheus and grafana

source

prometheus

grafana

telegraf

install

alertmanager

1wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz -O alertmanager.tar.gz
2tar -zxvf alertmanager.tar.gz
3cd alertmanager
4#you will see the alertmanager binary and a simple yaml config
5./alertmanager -h
6#show usage of alertmanager

alertmanager.yml

 1route:
 2  group_by: ["alertname"]
 3  group_wait: 30s
 4  group_interval: 5m
 5  repeat_interval: 1h
 6  receiver: "web.hook"
 7receivers:
 8  - name: "web.hook"
 9    webhook_configs:
10      - url: "http://127.0.0.1:5001/"
11inhibit_rules:
12  - source_match:
13      severity: "critical"
14    target_match:
15      severity: "warning"
16    equal: ["alertname", "dev", "instance"]

./alertmanater start alertmanager server on port 9093

node exporter

Prometheus collects variable metrics from variable kinds of exporter

1wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz -O node_exporter.tar.gz
2tar -zxvf node_exporter.tar.gz
3cd node_exporter
4#you will see the node exporter binary
5./node_exporter
6#start node exporter on port 9100

prometheus

1wget https://github.com/prometheus/prometheus/releases/download/v2.36.2/prometheus-2.36.2.linux-amd64.tar.gz -O prometheus.tar.gz
2tar -zxvf prometheus
3cd prometheus
4#you will see the prometheus binary and a simple yaml config
5./prometheus -h
6#show usage of prometheus

modify your prometheus.yml prometheus.yml

 1# my global config
 2global:
 3  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
 4  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
 5  # scrape_timeout is set to the global default (10s).
 6# Alertmanager configuration
 7alerting:
 8  alertmanagers:
 9    - static_configs:
10        - targets:
11            - localhost:9093
12# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
13rule_files:
14  - "first_rules.yml"
15  # - "second_rules.yml"
16# A scrape configuration containing exactly one endpoint to scrape:
17# Here it's Prometheus itself.
18scrape_configs:
19  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
20  - job_name: "prometheus"
21    # metrics_path defaults to '/metrics'
22    # scheme defaults to 'http'.
23    static_configs:
24      - targets: ["localhost:9090"]
25  - job_name: "node"
26    static_configs:
27      - targets: ["localhost:9100"]

first_rules.yml

 1groups:
 2  - name: high_cpu
 3    rules:
 4      - alert: HostHighCpuLoad
 5        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 1
 6        for: 0m
 7        labels:
 8          severity: warning
 9        annotations:
10          summary: Host high CPU load (instance {{ $labels.instance }})
11          description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

start a simple web server receives the webhook

1from flask import Flask, request
2app = Flask(__name__)
3@app.route("/", methods=["POST"])
4def index():
5    """docstring for index"""
6    print(request.data, "data")
7    print(request.args, "args")
8    print(request.form, "form")
9    return "ok"

grafana

1sudo apt-get install -y adduser libfontconfig1
2wget https://dl.grafana.com/oss/release/grafana_9.0.0_amd64.deb
3sudo dpkg -i grafana_9.0.0_amd64.deb
4sudo systemctl enable --now grafana-server.service

open you browser open grafana dashboard login use admin/admin and change your default password open setting datasource add prometheus datasource find node dashboard in grafana dashboard market import dashboard by id 1860

screenshot