首页
常用命令
About Me
推荐
weibo
github
Search
1
linuxea:gitlab-ci之docker镜像质量品质报告
48,760 阅读
2
linuxea:如何复现查看docker run参数命令
19,491 阅读
3
Graylog收集文件日志实例
17,808 阅读
4
git+jenkins发布和回滚示例
17,364 阅读
5
linuxea:jenkins+pipeline+gitlab+ansible快速安装配置(1)
17,353 阅读
ops
Openvpn
Sys Basics
rsync
Mail
NFS
Other
Network
HeartBeat
server 08
Code
Awk
Shell
Python
Golang
virtualization
KVM
Docker
openstack
Xen
kubernetes
kubernetes-cni
Service Mesh
Data
Mariadb
PostgreSQL
MongoDB
Redis
MQ
Ceph
TimescaleDB
kafka
surveillance system
zabbix
ELK Stack
Open-Falcon
Prometheus
Web
apache
Tomcat
Nginx
自动化
Puppet
Ansible
saltstack
Proxy
HAproxy
Lvs
varnish
更多
音乐
影视
music
Internet Consulting
最后的净土
软件交付
持续集成
gitops
devops
登录
Search
标签搜索
kubernetes
docker
zabbix
Golang
mariadb
持续集成工具
白话容器
linux基础
nginx
elk
dockerfile
Gitlab-ci/cd
最后的净土
基础命令
jenkins
docker-compose
gitops
haproxy
saltstack
Istio
marksugar
累计撰写
675
篇文章
累计收到
140
条评论
首页
栏目
ops
Openvpn
Sys Basics
rsync
Mail
NFS
Other
Network
HeartBeat
server 08
Code
Awk
Shell
Python
Golang
virtualization
KVM
Docker
openstack
Xen
kubernetes
kubernetes-cni
Service Mesh
Data
Mariadb
PostgreSQL
MongoDB
Redis
MQ
Ceph
TimescaleDB
kafka
surveillance system
zabbix
ELK Stack
Open-Falcon
Prometheus
Web
apache
Tomcat
Nginx
自动化
Puppet
Ansible
saltstack
Proxy
HAproxy
Lvs
varnish
更多
音乐
影视
music
Internet Consulting
最后的净土
软件交付
持续集成
gitops
devops
页面
常用命令
About Me
推荐
weibo
github
搜索到
84
篇与
surveillance system
的结果
2022-03-03
linuxea:Prometheus远程存储Promscale和TimescaleDB测试
promscale 是一个开源的可观察性后端,用于由 SQL 提供支持的指标和跟踪。它建立在 PostgreSQL 和 TimescaleDB 的强大和高性能基础之上。它通过 OpenTelemetry Collector 原生支持 Prometheus 指标和 OpenTelemetry 跟踪以及许多其他格式,如 StatsD、Jaeger 和 Zipkin,并且100% 兼容 PromQL。其完整的 SQL 功能使开发人员能够关联指标、跟踪和业务数据,从而获得新的有价值的见解,当数据在不同系统中孤立时是不可能的。它很容易与 Grafana 和 Jaeger 集成,以可视化指标和跟踪。它建立在 PostgreSQL 和 TimescaleDB 之上,继承了坚如磐石的可靠性、高达 90% 的本机压缩、连续聚合以及在全球数百万个实例上运行的系统的操作成熟度。Promscale 可以用作 Grafana和PromLens等可视化工具的 Prometheus 数据源。Promscale 包括两个组件:Promscale 连接器:一种无状态服务,为可观察性数据提供摄取接口,处理该数据并将其存储在 TimescaleDB 中。它还提供了一个使用 PromQL 查询数据的接口。Promscale 连接器自动设置 TimescaleDB 中的数据结构以存储数据并在需要升级到新版本的 Promscale 时处理这些数据结构中的更改。TimescaleDB:存储所有可观察性数据的基于 Postgres 的数据库。它提供了用于查询数据的完整 SQL 接口以及分析函数、列压缩和连续聚合等高级功能。TimescaleDB 提供了很大的灵活性来存储业务和其他类型的数据,然后你可以使用这些数据与可观察性数据相关联。Promscale 连接器使用 Prometheusremote_write接口摄取 Prometheus 指标、元数据和 OpenMetrics 示例。它还使用 OpenTelemetry 协议 (OTLP) 摄取 OpenTelemetry 跟踪。它还可以使用 OpenTelemetry 收集器以其他格式摄取指标和跟踪,以通过 Prometheusremote_write接口和 OpenTelemetry 协议处理和发送它们。例如,你可以使用 OpenTelemetry Collector 将 Jaeger 跟踪和 StatsD 指标摄取到 Promscale。对于 Prometheus 指标,Promscale 连接器公开 Prometheus API 端点,用于运行 PromQL 查询和读取元数据。这允许你将支持 Prometheus API 的工具(例如 Grafana)直接连接到 Promscale 进行查询。也可以向 Prometheus 发送查询,并让 Prometheus 使用接口上的 Promscale 连接器从 Promscale 读取数据 remote_read。你还可以使用 SQL 在 Promscale 中查询指标和跟踪,这允许你使用与 PostgreSQL 集成的许多不同的可视化工具。例如,Grafana 支持通过 PostgreSQL 数据源使用开箱即用的 SQL 查询 Promscale 中的数据我准备通过容器的方式进行尝试,我们先安装docker和docker-composeyum install -y yum-utils \ device-mapper-persistent-data \ lvm2 yum-config-manager \ --add-repo \ https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/centos/docker-ce.repo yum install docker-ce docker-ce-cli containerd.io docker-compose -y编排我按照官网的docker配置,进行编排了compose进行测试version: '2.2' services: timescaledb: image: timescaledev/promscale-extension:latest-ts2-pg13 container_name: timescaledb restart: always hostname: "timescaledb" network_mode: "host" environment: - csynchronous_commit=off - POSTGRES_PASSWORD=123 volumes: - /etc/localtime:/etc/localtime:ro - /data/prom/timescaledb/data:/var/lib/postgresql/data:rw mem_limit: 512m user: root stop_grace_period: 1m promscale: image: timescale/promscale:0.10 container_name: promscale restart: always hostname: "promscale" network_mode: "host" environment: - PROMSCALE_DB_PASSWORD=123 - PROMSCALE_DB_PORT=5432 - PROMSCALE_DB_NAME=postgres - PROMSCALE_DB_HOST=127.0.0.1 - PROMSCALE_DB_SSL_MODE=allow volumes: - /etc/localtime:/etc/localtime:ro # - /data/prom/postgresql/data:/var/lib/postgresql/data:rw mem_limit: 512m user: root stop_grace_period: 1m grafana: image: grafana/grafana:8.3.7 container_name: grafana restart: always hostname: "grafana" network_mode: "host" #environment: # - GF_INSTALL_PLUGINS="grafana-clock-panel,grafana-simple-json-datasource" volumes: - /etc/localtime:/etc/localtime:ro - /data/grafana/plugins:/var/lib/grafana/plugins mem_limit: 512m user: root prometheus: image: prom/prometheus:v2.33.4 container_name: prometheus restart: always hostname: "prometheus" network_mode: "host" #environment: volumes: - /etc/localtime:/etc/localtime:ro - /data/prom/prometheus/data:/prometheus:rw # NOTE: chown 65534:65534 /data/prometheus/ - /data/prom/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - /data/prom/prometheus/alert:/etc/prometheus/alert #- /data/prom/prometheus/ssl:/etc/prometheus/ssl command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention=45d' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--web.enable-lifecycle' - '--web.enable-admin-api' mem_limit: 512m user: root stop_grace_period: 1m node_exporter: image: prom/node-exporter:v1.3.1 container_name: node_exporter user: root privileged: true network_mode: "host" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)' restart: unless-stoppedgrafana我们这里使用的root用户就是因为需要手动安装下插件bash-5.1# grafana-cli plugins install grafana-clock-panel ✔ Downloaded grafana-clock-panel v1.3.0 zip successfully Please restart Grafana after installing plugins. Refer to Grafana documentation for instructions if necessary. bash-5.1# grafana-cli plugins install grafana-simple-json-datasource ✔ Downloaded grafana-simple-json-datasource v1.4.2 zip successfully Please restart Grafana after installing plugins. Refer to Grafana documentation for instructions if necessary.配置grafana在这里下载一些模板https://grafana.com/grafana/dashboards/?pg=hp&plcmt=lt-box-dashboards&search=prometheusVisualize data in Promscaleprometheus我们可以尝试配置远程, 配置参数可以查看官网remote_write: - url: "http://127.0.0.1:9201/write" remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true远程配置如下remote_write: - url: "http://127.0.0.1:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true prometheus.yaml如下global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - scheme: http static_configs: - targets: - '127.0.0.1:9093' rule_files: - "alert/host.alert.rules" - "alert/container.alert.rules" - "alert/targets.alert.rules" scrape_configs: - job_name: prometheus scrape_interval: 30s static_configs: - targets: ['127.0.0.1:9090'] - targets: ['127.0.0.1:9093'] - targets: ['127.0.0.1:9100'] remote_write: - url: "http://127.0.0.1:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true重新启动后查看日志ts=2022-03-03T01:35:28.123Z caller=main.go:1128 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml ts=2022-03-03T01:35:28.137Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Starting WAL watcher" queue=797d34 ts=2022-03-03T01:35:28.138Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Starting scraped metadata watcher" ts=2022-03-03T01:35:28.277Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Replaying WAL" queue=797d34 ts=2022-03-03T01:35:38.177Z caller=main.go:1165 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=10.053377011s db_storage=1.82µs remote_storage=13.752341ms web_handler=549ns query_engine=839ns scrape=10.038744417s scrape_sd=44.249µs notify=41.342µs notify_sd=6.871µs rules=30.465µs ts=2022-03-03T01:35:38.177Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ts=2022-03-03T01:35:53.584Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Done replaying WAL" duration=25.446317635s查看数据[root@localhost data]# docker exec -it timescaledb sh / # su - postgres timescaledb:~$ psql psql (13.4) Type "help" for help. postgres=# 我们查询一个过去五分钟io的指标查询指标SELECT * from node_disk_io_now WHERE time > now() - INTERVAL '5 minutes'; time | value | series_id | labels | device_id | instance_id | job_id ----------------------------+-------+-----------+---------------+-----------+-------------+-------- 2022-03-02 21:03:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:06:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:07:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:07:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:08:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:03:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:06:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:07:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:07:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:03:58.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3在进行一次聚合查询标签键的查询值每个标签键都扩展为自己的列,该列将外键标识符存储为其值。这允许JOIN按标签键和值进行聚合和过滤。要检索由标签 ID 表示的文本,可以使用该val(field_id) 函数。这使你可以使用特定的标签键对所有系列进行聚合等操作。例如,要查找指标的中值node_disk_io_now,按与其关联的工作分组:SELECT val(job_id) as job, percentile_cont(0.5) within group (order by value) AS median FROM node_disk_io_now WHERE time > now() - INTERVAL '5 minutes' GROUP BY job_id;如下postgres=# SELECT postgres-# val(job_id) as job, postgres-# percentile_cont(0.5) within group (order by value) AS median postgres-# FROM postgres-# node_disk_io_now postgres-# WHERE postgres-# time > now() - INTERVAL '5 minutes' postgres-# GROUP BY job_id; job | median ------------+-------- prometheus | 0 (1 row)查询指标的标签集任何度量行中的labels字段表示与测量相关的完整标签集。它表示为标识符数组。要以 JSON 格式返回整个标签集,你可以使用该jsonb()函数,如下所示:SELECT time, value, jsonb(labels) as labels FROM node_disk_io_now WHERE time > now() - INTERVAL '5 minutes';如下 time | value | labels ----------------------------+-------+------------------------------------------------------------------------------------------------------- 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:14:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"}查询node_disk_infopostgres=# SELECT * FROM prom_series.node_disk_info; series_id | labels | device | instance | job | major | minor -----------+------------------------+--------+----------------+------------+-------+------- 250 | {150,140,91,3,324,325} | dm-0 | 127.0.0.1:9100 | prometheus | 253 | 0 439 | {150,253,91,3,508,325} | sda | 127.0.0.1:9100 | prometheus | 8 | 0 440 | {150,258,91,3,507,325} | sr0 | 127.0.0.1:9100 | prometheus | 11 | 0 516 | {150,252,91,3,324,564} | dm-1 | 127.0.0.1:9100 | prometheus | 253 | 1 (4 rows)带标签查询SELECT jsonb(labels) as labels, value FROM node_disk_info WHERE time < now(); Results:labels | value -----------------------------------------------------------------------------------------------------------------------------------+------- {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | NaN {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1通过命令进行查看她的指标视图postgres=# \d+ node_disk_info View "prom_metric.node_disk_info" Column | Type | Collation | Nullable | Default | Storage | Description -------------+--------------------------+-----------+----------+---------+----------+------------- time | timestamp with time zone | | | | plain | value | double precision | | | | plain | series_id | bigint | | | | plain | labels | label_array | | | | extended | device_id | integer | | | | plain | instance_id | integer | | | | plain | job_id | integer | | | | plain | major_id | integer | | | | plain | minor_id | integer | | | | plain | View definition: SELECT data."time", data.value, data.series_id, series.labels, series.labels[2] AS device_id, series.labels[3] AS instance_id, series.labels[4] AS job_id, series.labels[5] AS major_id, series.labels[6] AS minor_id FROM prom_data.node_disk_info data LEFT JOIN prom_data_series.node_disk_info series ON series.id = data.series_id;更多的查询,可以查看官网的教程计划删除promscale里是pg里配删除计划的是90天删除通过SELECT * FROM prom_info.metric;查看我们可以通过调整来修改TimescaleDB 包括一个后台作业调度框架,用于自动化数据管理任务,例如启用简单的数据保留策略。为了添加这样的数据保留策略,数据库管理员可以创建、删除或更改导致drop_chunks根据某个定义的计划自动执行的策略。要在超表上添加这样的策略,不断导致超过 24 小时的块被删除,只需执行以下命令:SELECT add_retention_policy('conditions', INTERVAL '24 hours');随后删除该策略:SELECT remove_retention_policy('conditions');调度程序框架还允许查看已调度的作业:SELECT * FROM timescaledb_information.job_stats;创建数据保留策略以丢弃超过 6 个月的数据块:SELECT add_retention_policy('conditions', INTERVAL '6 months');复制使用基于整数的时间列创建数据保留策略:SELECT add_retention_policy('conditions', BIGINT '600000');我们可以调整prometheus的数据,参考Data RetentionSELECT set_default_retention_period(180 * INTERVAL '1 day')如下postgres=# SELECT postgres-# set_default_retention_period(180 * INTERVAL '1 day'); set_default_retention_period ------------------------------ t (1 row)已经修改为180天我们打开prometheus和grafana都可以正常的查看grafana关于当前版本的压测在slack上,有一个朋友做了压测Promscale版本 2.30.0promscale 0.10.0Hello everyone, I am doing a promscale+timescaledb performance test with 1 promscale(8cpu 32GB memory), 1 timescaledb(postgre12.9+timescale2.5.2 with 16cpu 32G mem), 1 prometheus(8cpu 32G mem), simulate 2500 node_exporters( 1000 metrics/min * 2500 = 2.5 million metrics/min ) . But it seams not stable,做一个promscale+timescaledb性能测试,1个promscale(8cpu 32GB内存),1个timescaledb(postgre12.9+timescale2.5.2 with 16cpu 32G mem),1个prometheus(8cpu 32G mem),模拟2500个node_exporters(1000 指标/分钟 * 2500 = 250 万指标/分钟)。 但它的接缝不稳定异常如下there are warninigs in prometheus: level=info ts=2022-03-01T12:36:33.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=info ts=2022-03-01T12:36:34.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=warn ts=2022-03-01T12:36:34.482Z caller=watcher.go:101 msg="[WARNING] Ingestion is a very long time" duration=5m9.705407837s threshold=1m0s level=info ts=2022-03-01T12:36:35.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=info ts=2022-03-01T12:36:40.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=70000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z and errors in prometheus: Mar 01 20:38:55 localhost start-prometheus.sh[887]: ts=2022-03-01T12:38:55.288Z caller=dedupe.go:112 component=remote level=warn remote_name=ceec38 url=http://192.168.105.76:9201/write msg="Failed to send batch, retrying" err="Post \"http://192.168.105.76:9201/write\": context deadline exceeded" any way to increase the thoughput at current configuration?他们推荐使用 remote_write: - url: "http://promscale:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 随后将配置pg 更改为 14.2 和 remote_write 设置,将 promscale mem 增加到 32G,仍然不稳定。请注意,这是2500台的节点压测,那么,这个朋友的测试可以看到至少在目前看来,promscale的开发版本仍然是处于一个初期。官方并没有进行可靠性压测 .我们期待未来的稳定版本
2022年03月03日
1,238 阅读
0 评论
0 点赞
2021-12-30
linuxea:清理kube-prometheus历史数据
通常在k8s中,pod是随时可以被替换的,在整个环境里往往我们不太关注某一条鱼,只关注整个鱼群的状态,因此监控数据不会存储太长,因为借鉴意义并不大。但是有时的确想要从 Prometheus 中删除一些指标,如果这些指标不需要,或者只需要释放一些磁盘空间。Prometheus 中的时间序列只能通过管理 HTTP API 删除(默认禁用)。--web.enable-admin-apiAs of Prometheus 2.0, the --web.enable-admin-api flag controls access to the administrative HTTP API which includes functionality such as deleting time series. This is disabled by default. If enabled, administrative and mutating functionality will be accessible under the /api/*/admin/ paths. The --web.enable-lifecycle flag controls HTTP reloads and shutdowns of Prometheus. This is also disabled by default. If enabled they will be accessible under the /-/reload and /-/quit paths.要启用它--web.enable-admin-api,根据安装方法通过启动脚本或 docker-compose 文件将标志传递给 Prometheus。删除时间序列指标使用以下语法删除与某个标签匹配的所有时间序列指标:curl -X POST \ -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={name="linuxea.com"}'要删除与 somejob或匹配的时间序列指标instance,请运行:curl −X POST −g 'http://localhost:9090/api/v1/admin/tsdb/deleteseries?match[]=job="nodeexporter"' curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="node_exporter"}' curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="192.168.0.1:9100"}'要从 Prometheus 中删除所有数据,请运行:curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={name=~".+"}'请注意,上述 API 调用不会立即删除数据实际数据仍然存在于磁盘上,将在未来的压缩中被清除。要确定何时删除旧数据,使用--storage.tsdb.retention选项 eg --storage.tsdb.retention='365d'(默认情况下,Prometheus 将数据保留 15 天)要完全删除通过delete_series发送clean_tombstonesAPI 调用删除的数据:curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones'删除所有历史数据我们使用的kube-prometheus,添加--web.enable-admin-api方式有所不同打开prometheus-prometheus.yaml 添加enableAdminAPI: true字段 probeNamespaceSelector: {} probeSelector: {} replicas: 2 resources: requests: memory: 400Mi enableAdminAPI: true ruleSelector: matchLabels: prometheus: k8s而后kubectl apply -f prometheus-prometheus.yaml,pod起来后查看 containers: - args: - '--web.console.templates=/etc/prometheus/consoles' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--config.file=/etc/prometheus/config_out/prometheus.env.yaml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d' - '--web.enable-lifecycle' - '--storage.tsdb.no-lockfile' - '--web.route-prefix=/' - '--web.enable-admin-api' image: 'quay.io/prometheus/prometheus:v2.26.0' 准备开始删除curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'查看删除前的大小[root@linuxea.com prometheus-db]# ls 01FPZGK1S0BZ9FZMSVPYWA1F52 01FQPP5PJTEMT1BARN8ARJTBKB 01FR222TD3ZWYAZTNRJP5SCJ6M 01FR4VB8N4J0D2DY979PTHZQY1 wal 01FQ59ZR5Z9VNNGY3JSXP6F9AQ 01FQWFJAC72YCBKEN19F39PYCM 01FR46R88Q9NN7GR99MMGX13VG 01FR526TSKK8JTN493PTBQH0XC 01FQB3CCFVNSP890V5KE07YD3M 01FQYDBKTHH8KB8K5GT581BKGE 01FR4MFC9JX6NK81G4Q1W0CJNY chunks_head 01FQGWS2M82Q23H0H2TD0JXK1Y 01FR0B54YSBJ0R54B5171P4Z7R 01FR4VB3HHRN0EE1GQCM6G0A77 queries.active [root@linuxea.com nfs-k8s]# du -sh monitoring-prometheus-k8s-db-prometheus-k8s-0-pvc-ac911baf-f1f1-4bf3-a1f8-af2cb13c5d90/ 7.1G monitoring-prometheus-k8s-db-prometheus-k8s-0-pvc-ac911baf-f1f1-4bf3-a1f8-af2cb13c5d90/找到svc[root@linuxea.com nfs-k8s]# kubectl -n monitoring get svc |grep prometheus-k8s prometheus-k8s NodePort 10.68.110.123 <none> 9090:30090/TCP 128d尝试curl一下[root@linuxea.com nfs-k8s]# curl 10.68.110.123:9090 <a href="/graph">Found</a>.执行[root@linuxea.com manifests]# curl -X POST -g 'http://10.68.110.123:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'而后在查看大小[root@linuxea.com manifests]# kubectl -n monitoring exec -it prometheus-k8s-0 -- /bin/sh Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ du -sh ./ 1004.4M ./ /prometheus $ ls 01FR46R88Q9NN7GR99MMGX13VG 01FR4VB8N4J0D2DY979PTHZQY1 chunks_head 01FR4MFC9JX6NK81G4Q1W0CJNY 01FR526TSKK8JTN493PTBQH0XC queries.active 01FR4VB3HHRN0EE1GQCM6G0A77 01FR592J0H8ANAWMFP2VVQ3NX9 wal /prometheus $ du -sh ./ 1006.6M ./完全删除通过delete_series发送clean_tombstonesAPI 调用删除的数据[root@linuxea.com manifests]# curl -X POST -g 'http://10.68.110.123:9090/api/v1/admin/tsdb/clean_tombstones'在查看大小[root@linuxea.com manifests]# kubectl -n monitoring exec -it prometheus-k8s-0 -- /bin/sh Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ du -sh ./ 986.6M ./删除完成后,将--web.enable-admin-api关闭。修改enableAdminAPI: false字段即可延申参考tsdb-admin-apisAdd flag to enable prometheus web admin API and fixes #1215kube-prometheusspecprometheus-prometheus.yaml
2021年12月30日
1,759 阅读
0 评论
1 点赞
2021-12-16
linuxea:关于kube-prometheus中CPUThrottlingHigh
我们遇到的场景是CPUThrottlingHigh 警报被正常触发,而触发的对象的CPU本身并不高,或者空闲。鉴于此,我们开始怀疑这个警报的必然性。通常在许多情况下,会将此警报修改或者沉默,因为应用程序对延迟不敏感,即使受到限制也可以正常工作,警报基于原因而非症状。因此警报的级别是Info。但是并不能说明此警报是误报。并且沉默只会隐藏背后的真正问题。目前这个问题仍然在讨论中,特别是在这个讨论的特别激烈108,而后在67577也有进一步的讨论表达式如下:sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 )目前,总结了几种易处理的方式1,修改警报阈值比例,或者禁止他 2,取消或者修改对这些 pod 的限制3, 内核4.18或者更高3,完全禁止Kubernetes CFS配额(kubelet配置--cpu-cfs-quota=false)我们尝试修改阈值kubectl -n monitoring edit PrometheusRule prometheus-k8s-rules修改 - alert: CPUThrottlingHigh annotations: description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.' runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/cputhrottlinghigh summary: Processes experience elevated CPU throttling. expr: | sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 75 / 100 ) for: 15m labels: severity: info其他相关参考:https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108 https://github.com/prometheus-operator/prometheus-operator/issues/2063 https://github.com/kubernetes/kubernetes/issues/67577 https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ https://bugzilla.kernel.org/show_bug.cgi?id=198197 https://github.com/torvalds/linux/commit/512ac999d2755d2b7109e996a76b6fb8b888631d https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1 https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt https://github.com/prometheus-operator/kube-prometheus/issues/214 https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/453 https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/b71dd35c6a1d509a1ee902eebe7afe943d8ee4b0/alerts/resource_alerts.libsonnet#L13 https://www.youtube.com/watch?v=UE7QX98-kO0 https://github.com/prometheus-operator/kube-prometheus/issues/861 https://github.com/prometheus-operator/kube-prometheus/blob/main/jsonnet/kube-prometheus/components/alertmanager.libsonnet#L26-L42 https://devops.stackexchange.com/questions/6494/prometheus-alert-cputhrottlinghigh-raised-but-monitoring-does-not-show-it
2021年12月16日
1,851 阅读
0 评论
0 点赞
2019-05-25
linuxea:zabbix 4.2 使用Simple check监控VIP和端口
在实际使用中,通常我们会对一些端口进行监控,比如nginx,mariadb,php等。要完成一个端口监控是简单容易的。net.tcp.listen[80]如上,既可对tcp 80获取状态,而后配置Triggeres做告警处理即可{HOST.NAME} nginx is not running {nginx_80_port:net.tcp.listen[80].last()}<>1 Enabled但是,现在。我要说的并不是这个。在一个环境中一般来讲会有负载均衡和HA冗余,就不能避免的要使用VIP,顾名思义就是虚拟IP。通过操作虚拟IP的飘逸来完成故障后的IP上业务的切换。通常而言,每个VIP都会伴随一个端口。如: lvs,haproxy,nginx。阅读这篇文章,你将了解如何监控VIP:PORT。尤为提醒的是,通过某一台机器添加一次模板就可以对VIP和VIP端口进行检查。创建模板1,Configuration->Templates->Create template 输入名称即可。如:Template App Telnet VIP2,并且Create application ,如:Telnet VIP3,创建Itemstype: Simple check为什么是Simple check,而不是telent?相比较telnet与 Simple check,后者要比前者好用的多,在我看来,不需要写脚本,简单配置就能够完成我们需求。所以,我们这里仍然使用的是: net.tcp.service。推荐阅读官网的service_check_details与simple_checksKey: net.tcp.service[tcp,10.10.195.99,9200]意思为:获取10.10.10.195.99ip地址的tcp 9200端口状态在Applications中选择Telnet VIP除此之外,创建Triggers创建Triggers0 - 服务停止,1 - 服务正在运行 ,由此,我们的触发器就如下:{Template App Telnet VIP:net.tcp.service[tcp,10.10.195.99,9200].last(#3)}=0如果最新三次回去的值都是0就触发告警。如下图即可:到此,使用zabbix简单监控ip和端口已经完成。自动发现参考自动发现基于zabbix4.2 zabbix Discovery 教程延伸阅读service_check_detailssimple_checks阅读更多zabbix监控教程docker教程zabbix-complete-workslinuxea:Zabbix-complete-works之Zabbix基础安装配置linuxea:zabbix4.2新功能之TimescaleDB数据源测试
2019年05月25日
3,392 阅读
0 评论
0 点赞
2019-05-07
linuxea:zabbix4.2新功能之TimescaleDB数据源测试
Zabbix发布了4.2版本,带有一系列新功能。在Zabbix自己的网站上有一个很好的概述,但一定要检查文档中的“Zabbix 4.2中的新功能”部分,因为它更完整!一个新功能是TimescaleDB的实验支持。一个当前流行的开源时间序列的SQL数据库 ,TimescaleDB打包为PostgreSQL扩展。这套数据库由PostgreSQL构建,提供跨时间和空间的自动分区(分区键),同时保留标准的PostgreSQL接口以及完整的SQL支持。前言为什么要使用时间序列的SQL数据库?以及如何配置它,以及它是什么?首先是数据库分区,但要了解分区,我们需要考虑下zabbix server的历史数据。假设此时有5个历史数据和两个表,数据的时间设置在前段中进行配置,他可以是任何时间。然而现在,我们说的是zabbix的内部趋势数据的House keeping,而House keepingje是控制从数据库中删除过时的信息的。而一个任务去一个数据库内扫描所有的历史和趋势表以及更老的数据,在指定删除。但是这个过程中会变得缓慢,因为他将会扫描所有表格删除数据,在同时还有其他的内部调用流程,这样一来,数据删除的过程将会更慢,也无法删除更多的数据。现在我们讨论下如何进行分区,假设我们使用三个月的数据,按天分组多个分区,如:1,2,3,4,此时如我们只想保留最近一天的,就会删除1,2,3三张表分区,而不是扫描表 。这样一来,首先没有了性能的问题,第二就是更快了,并且释放了磁盘空间。而TimescaleDB就是时间序列的数据库,内部自动分区,TimescaleDB不是一个数据库引擎,而是一个积极SQL数据库的扩展。安装Zabbix官方docker有一个选项打开就可以支持TimescaleDB:# ENABLE_TIMESCALEDB=true 在4.2的版本中我在我的环境中实验了,一如既往,我会选择“Docker安装”(使用docker-compose),docker官方也提供了现有的docker容器,阅读zabbix文档和GitHub的仓库。为此,我在此前的我自己的github上提供了TimescaleDB数据源的安装方式,参阅此docker-compose,但是目前,目前,Zabbix-proxy不支持TimescaleDB。参考我的: github上的https://github.com/marksugar/zabbix-complete-works快速部署curl -Lk https://raw.githubusercontent.com/marksugar/zabbix-complete-works/master/zabbix_server/zabbix-install/install_zabbix_timescaledb.sh|bashtimescaledb在timescaledb中挂在数据目录到本地 - /data/zabbix/postgresql/data:/var/lib/postgresql/data:rw传递两个环境变量设置用户和密码 environment: - POSTGRES_USER=zabbix - POSTGRES_PASSWORD=abc123version: '3.5' services: timescaledb: image: timescale/timescaledb:latest-pg11-oss container_name: timescaledb restart: always network_mode: "host" volumes: - /etc/localtime:/etc/localtime:ro - /data/zabbix/postgresql/data:/var/lib/postgresql/data:rw user: root stop_grace_period: 1m environment: - POSTGRES_USER=zabbix - POSTGRES_PASSWORD=abc123 logging: driver: "json-file" options: max-size: "1G"zabbix使用pgsql镜像zabbix/zabbix-server-pgsql:alpine-4.2-latest zabbix/zabbix-web-nginx-pgsql:alpine-4.2-latestzabbix-server-pgsql在zabbix-server-pgsql环境变量中修改数据库链接 environment: - ENABLE_TIMESCALEDB=true - DB_SERVER_HOST=127.0.0.1 - POSTGRES_DB=zabbix - POSTGRES_USER=zabbix - POSTGRES_PASSWORD=abc123 并且开启HOUSEKEEPINGFREQUENCY - ZBX_HOUSEKEEPINGFREQUENCY=1 - ZBX_MAXHOUSEKEEPERDELETE=100000zabbix-web-nginx-pgsql在zabbix-web-nginx-pgsql的环境变量中也需要修改 environment: - DB_SERVER_HOST=127.0.0.1 - POSTGRES_DB=zabbix - POSTGRES_USER=zabbix - POSTGRES_PASSWORD=abc123 - ZBX_SERVER_HOST=127.0.0.1配置完成后,在web界面中默认已经启用了在文档中的提到配置,为了对历史和趋势使用分区管理,必须启用这些选项。可以仅对趋势(通过设置“ 覆盖项目趋势期”)或仅对历史记录(“ 覆盖项目历史记录期间”)使用TimescaleDB分区。Override item history period Override item trend period可以通过如下方式查看,在Administration → General → Housekeeping查看勾选的Override item history period和Override item trend period现在zabbix在TimescaleDB上运行,这样数据库在查询和提取的时候是有一定的好处,如:zabbix的housekeeping,在TimescaleDB之前,使用许多DELETE查询删除数据,这肯定会损害整体性能。现在使用TimescaleDB的分块表,过时的数据将作为一个整体进行转储,而性能负担则更少。测试假如此刻历史数据保留为1天,那么在数据库中,其他的数据将会被删除类似这样存储几天的数据[root@LinuxEA ~]# docker exec -it timescaledb bash bash-4.4# psql -U zabbix psql (11.2) Type "help" for help.zabbix=# \d+ history Table "public.history" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------------+-----------+----------+---------+---------+--------------+------------- itemid | bigint | | not null | | plain | | clock | integer | | not null | 0 | plain | | value | numeric(16,4) | | not null | 0.0000 | main | | ns | integer | | not null | 0 | plain | | Indexes: "history_1" btree (itemid, clock) "history_clock_idx" btree (clock DESC) Triggers: ts_insert_blocker BEFORE INSERT ON history FOR EACH ROW EXECUTE PROCEDURE _timescaledb_internal.insert_blocker() Child tables: _timescaledb_internal._hyper_1_11_chunk, _timescaledb_internal._hyper_1_16_chunk, _timescaledb_internal._hyper_1_21_chunk, _timescaledb_internal._hyper_1_26_chunk, _timescaledb_internal._hyper_1_6_chunkzabbix=# \d+ trends Table "public.trends" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+---------------+-----------+----------+---------+---------+--------------+------------- itemid | bigint | | not null | | plain | | clock | integer | | not null | 0 | plain | | num | integer | | not null | 0 | plain | | value_min | numeric(16,4) | | not null | 0.0000 | main | | value_avg | numeric(16,4) | | not null | 0.0000 | main | | value_max | numeric(16,4) | | not null | 0.0000 | main | | Indexes: "trends_pkey" PRIMARY KEY, btree (itemid, clock) "trends_clock_idx" btree (clock DESC) Triggers: ts_insert_blocker BEFORE INSERT ON trends FOR EACH ROW EXECUTE PROCEDURE _timescaledb_internal.insert_blocker() Child tables: _timescaledb_internal._hyper_6_14_chunk, _timescaledb_internal._hyper_6_19_chunk, _timescaledb_internal._hyper_6_24_chunk, _timescaledb_internal._hyper_6_29_chunk, _timescaledb_internal._hyper_6_9_chunk我们修改history和trends为1天后进行清理试试看,我们现在即将进行删除操作,timescaledb中的数据看似是三天的,其实只有两天的数据量,包含一个最早一天的和当前一天的,以保留一天为例开始清理[root@LinuxEA ~]# docker exec -it zabbix-server-pgsql bash bash-4.4# zabbix_server -R config_cache_reload zabbix_server [260]: command sent successfully bash-4.4# zabbix_server -R housekeeper_execute zabbix_server [261]: command sent successfully在回到timescaledbzabbix=# \d+ history Table "public.history" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------------+-----------+----------+---------+---------+--------------+------------- itemid | bigint | | not null | | plain | | clock | integer | | not null | 0 | plain | | value | numeric(16,4) | | not null | 0.0000 | main | | ns | integer | | not null | 0 | plain | | Indexes: "history_1" btree (itemid, clock) "history_clock_idx" btree (clock DESC) Triggers: ts_insert_blocker BEFORE INSERT ON history FOR EACH ROW EXECUTE PROCEDURE _timescaledb_internal.insert_blocker() Child tables: _timescaledb_internal._hyper_1_21_chunk, _timescaledb_internal._hyper_1_26_chunkzabbix=# \d+ trends Table "public.trends" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+---------------+-----------+----------+---------+---------+--------------+------------- itemid | bigint | | not null | | plain | | clock | integer | | not null | 0 | plain | | num | integer | | not null | 0 | plain | | value_min | numeric(16,4) | | not null | 0.0000 | main | | value_avg | numeric(16,4) | | not null | 0.0000 | main | | value_max | numeric(16,4) | | not null | 0.0000 | main | | Indexes: "trends_pkey" PRIMARY KEY, btree (itemid, clock) "trends_clock_idx" btree (clock DESC) Triggers: ts_insert_blocker BEFORE INSERT ON trends FOR EACH ROW EXECUTE PROCEDURE _timescaledb_internal.insert_blocker() Child tables: _timescaledb_internal._hyper_6_24_chunk, _timescaledb_internal._hyper_6_29_chunk为了看的更明显,我们在web查看自动发现参考自动发现基于zabbix4.2-zabbix-Discovery教程延伸阅读zabbix TimescaleDB阅读更多zabbix监控教程docker教程zabbix-complete-workslinuxea:Zabbix-complete-works之Zabbix基础安装配置
2019年05月07日
5,210 阅读
2 评论
0 点赞
2019-04-22
linuxea:Zabbix-complete-works之Zabbix基础安装配置
我花了一点时间整理了一套zabbix的安装脚本,便于部署和安装。它包括了zabbix-server,zabbix-agent的安装,初始化配置,在4.0之后加入了docker-compose,随后的server端都采用了docker安装。在最新的更新中,引入了elasticsearch:6.1.4。git地址:https://github.com/marksugar/zabbix-complete-works如果你喜欢这个项目,你可以在github上的zabbix-complete-works右上角点击 ♥ Star或者Fork.我使用了一套docker-compose来编排server端,对于zabbix-agent我使用脚本安装。docker和docker-compose安装参考-docker官网的安装方式 And - docker-compose安装参考docker-compose官网的安装方式先睹为快在zabbix最近的几个版本中的Graph绘图功能我非常讨喜,大致是这样的这样以来,我就可以在 一张图里面看到自定义一个组,或者一部分机器和某个items组成的一张图,这是非常有效的zabbix-server使用最新的4.2稳定版本,引入了elasticsearch,但是对于elasticsearch功能处于开发阶段,支持的版本有限,我这里使用的是6.1.4。我在此主要介绍zabbix的参数,因为这里使用的是docker,如果你要快速了解和安装,那么很有必要了解的。参考zabbix-complete-works项目的docker-compose.yaml文件zabbix/zabbix-server-mysql:alpine-4.2-latest根据以往的使用方式,将数据保存在本地 volumes: - /etc/localtime:/etc/localtime:ro - /etc/timezone:/etc/timezone:ro - /data/zabbix/zbx_env/usr/lib/zabbix/alertscripts:/usr/lib/zabbix/alertscripts:ro - /data/zabbix/zbx_env/usr/lib/zabbix/externalscripts:/usr/lib/zabbix/externalscripts:ro - /data/zabbix/zbx_env/var/lib/zabbix/modules:/var/lib/zabbix/modules:ro - /data/zabbix/zbx_env/var/lib/zabbix/enc:/var/lib/zabbix/enc:ro - /data/zabbix/zbx_env/var/lib/zabbix/ssh_keys:/var/lib/zabbix/ssh_keys:ro - /data/zabbix/zbx_env/var/lib/zabbix/mibs:/var/lib/zabbix/mibs:ro - /data/zabbix/zbx_env/var/lib/zabbix/snmptraps:/var/lib/zabbix/snmptraps:rw环境变量 environment: - DB_SERVER_HOST=127.0.0.1 - MYSQL_DATABASE=zabbix - MYSQL_USER=zabbix - MYSQL_PASSWORD=password - MYSQL_ROOT_PASSWORD=abc123 - ZBX_HISTORYSTORAGEURL=http://127.0.0.1:9200 # elasticsearch - ZBX_HISTORYSTORAGETYPES=dbl,uint,str,log,text # stor add elasticsearch type - DebugLevel=5 - HistoryStorageDateIndex=1 - ZBX_STARTDISCOVERERS=10这里的环境变量对应zabbix-server.conf的配置参数,只不过在前面加上了ZBX_而已注意1,这里提供了MYSQL_ROOT_PASSWORD是数据库root的密码。但这里提供了root密码后,zabbix-server会自动创建用户以及导入sql,请观察日志查看是否有报错产生。2,这里使用了elasticsearch,根据官网的文档,在server配置后,还需要修改web断的配置文件 - ZBX_HISTORYSTORAGEURL=http://127.0.0.1:9200 # elasticsearch - ZBX_HISTORYSTORAGETYPES=dbl,uint,str,log,textzabbix/zabbix-web-nginx-mysql:alpine-4.2-latest环境变量 environment: - DB_SERVER_HOST=127.0.0.1 - MYSQL_DATABASE=zabbix - MYSQL_USER=zabbix - MYSQL_PASSWORD=password - ZBX_SERVER_HOST=127.0.0.1 - ZBX_HISTORYSTORAGEURL=http://127.0.0.1:9200 - ZBX_HISTORYSTORAGETYPES=['dbl','uint','str', 'text', 'log'] # uint,dbl,str,log,text其中,提供在Zabbix-server环境变量中的密码,也就是web链接数据库的密码。而关于elasticsearch的配置需要和server相匹配。最终这里的变量会被替换到容器中的配置文件中。 - ZBX_HISTORYSTORAGEURL=http://127.0.0.1:9200 - ZBX_HISTORYSTORAGETYPES=['dbl','uint','str', 'text', 'log']快速安装mkdir /data/zabbix -p curl -Lk https://raw.githubusercontent.com/marksugar/zabbix-complete-works/master/zabbix_server/graphfont.TTF -o /data/zabbix/graphfont.ttf wget https://raw.githubusercontent.com/marksugar/zabbix-complete-works/master/zabbix_server/docker_zabbix_server/docker-compose.yaml docker-compose -f docker-compose.yaml up -d> elasticsearch你需要注意权限问题,如本示例docker-compose中需要授权: chown -R 1000.1000 /data/elasticsearch/我整理了索引文件,执行创建索引即可(创建索引尤为重要),你也可以参考官网文档正常情况下你将看到如下信息:$ curl http://127.0.0.1:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open str MQWM2bNNRzOvBywM7ne-lw 5 1 0 0 1.1kb 1.1kb yellow open .monitoring-es-6-2019.04.20 tIfs0MkNQUCI4YuEHRmQ6g 1 1 1926 208 901.6kb 901.6kb yellow open dbl Y0992hqaR8KTin9iXKsljQ 5 1 0 0 1.1kb 1.1kb yellow open text s2XMyJtdQQ27b9rS3nWVfg 5 1 0 0 1.1kb 1.1kb yellow open log MAysNczpSKGZbjfjJXBvTg 5 1 0 0 1.1kb 1.1kb yellow open uint JA_8kyXlSLqawyHzo28Ggw 5 1 0 0 1.1kb 1.1kbzabbix-agent快速部署curl -Lk https://raw.githubusercontent.com/marksugar/zabbix-complete-works/master/zabbix_agent/install-agentd.sh|bash -s local IPADDR 在zabbix-agent的附加脚本中监控的默认项体现如下:/root/.ssh/authorized_keys/etc/passwd/etc/zabbix/zabbix_agentd.confOOMiptables磁盘iotcpnginx和php-fpmmariadb-galera其中配置文件和脚本被打包在一个[zabbix_agent_status.tar.gz包中自动发现参考自动发现基于zabbix4.2 zabbix Discovery 教程阅读更多zabbix监控教程docker教程zabbix-complete-workslinuxea:zabbix4.2新功能之TimescaleDB数据源测试
2019年04月22日
5,720 阅读
1 评论
0 点赞
2018-12-25
linuxea:prometheus基于主机的自动发现(promcr)
prometheus自动发现(scrape them)在官网中方式有很多,我这里介绍的是使用consul_sd_config配合registrator来做,registrator作为node节点端,发现容器,且将发现信息注册给consul。拓扑如下:而prometheus中使用consul_sd_config进行重新标记,可以使用的元标签如下:__meta_consul_address:目标的地址__meta_consul_dc:目标的数据中心名称__meta_consul_metadata_<key>:目标的每个节点元数据键值__meta_consul_node:为目标定义的节点名称__meta_consul_service_address:目标的服务地址__meta_consul_service_id:目标的服务ID__meta_consul_service_metadata_<key>:目标的每个服务元数据键值__meta_consul_service_port:目标的服务端口__meta_consul_service:目标所属服务的名称__meta_consul_tags:由标记分隔符连接的目标的标记列表其中目标的IP编号和端口组装为 <__meta_consul_address>:<__meta_consul_service_port>。但是,在一些Consul设置中,相关地址在__meta_consul_service_address。在这些情况下,可以使用relabel 功能替换特殊__address__标签。pcr为了方便使用,我在git创建了一个pcr项目,地址如下https://github.com/marksugar/pcr在此简单说明如何进行基于主机做自动发现,具体可见github说明registratorregistrator运行在每一个node节点,要注册给consul必须是有独立的IP地址,这个ip可能必须是宿主机的ip地址。我们要做的事就是获取每个宿主机的ip地址,并且自动获取。为此,重新封装gliderlabs/registrator:v7容器是很有必要的。添加脚本获取#!/bin/sh # maintainer="linuxea.com" NDIP=`ip a s ${NETWORK_DEVIDE:-eth0}|awk '/inet/{print $2}'|sed -r 's/\/[0-9]{1,}//')` exec /bin/registrator -ip="${NDIP}" ${ND_CMD:--internal=false} consul://${NDIPSERVER_IP:-consul}:8500需要传递几个变量才能使用这个镜象NETWORK_DEVIDE: 网卡名称 NDIPSERVER_IP:consul server ip ND_CMD: 参数Example: environment: - REGISTRATOR_BIND_INTERFACE=eth0 - NETWORK_DEVIDE=eth0 - NDIPSERVER_IP=172.25.250.249 - ND_CMD=-internal=false而在模板中,这里的参数是这样的:- ND_CMD=-internal=false -retry-interval=30 -resync=180-retry-interval=30会在三分钟后自动重新联系CONSUL_SERVERcompose registrator: container_name: registrator image: marksugar/registrator:v7.1 network_mode: "host" depends_on: - consul - cadvisor - node_exporter - alertmanager - grafana - prometheus volumes: - /var/run/docker.sock:/tmp/docker.sock environment: - REGISTRATOR_BIND_INTERFACE=eth0 - NETWORK_DEVIDE=eth0 - NDIPSERVER_IP=172.25.250.249 - ND_CMD=-internal=false -retry-interval=30 -resync=360 cpu_shares: 14 mem_limit: 50m logging: driver: "json-file" options: max-size: "200M" labels: SERVICE_TAGS: prometheusconsulconsul可以是一个集群,也可以说单点,这里已单点的方式运行其中,需要注意绑定的网卡端口-bind '{{ GetInterfaceIP \"eth0\" }}'如果你不是eth0,请修改,如下: consul: container_name: consul image: consul:1.4.0 network_mode: "host" ports: - 8500:8500 command: "agent -server -ui -client=0.0.0.0 -dev -node=node0 -bind '{{ GetInterfaceIP \"eth0\" }}' -bootstrap-expect=1 -data-dir=/consul/data" labels: SERVICE_IGNORE: 'true' environment: - CONSUL_CLIENT_INTERFACE=eth0 cpu_shares: 30 mem_limit: 1024m logging: driver: "json-file" options: max-size: "200M" volumes: - ./consul/config:/consul/config - ./consul/data:/consul/data labels: SERVICE_TAGS: prometheus注册的机器都会在web接口中显示,大致像这样系统与容器在prom/node-exporter:v0.16.0中,选择prom/node-exporter:v0.16.0是有原因的,prom/node-exporter:v0.16.0是可以发现出磁盘,而在prom/node-exporter:v0.17.0经测试,是发现不了的cadvisor启动的端口是18880,这个端口不太会被占用 node_exporter: image: prom/node-exporter:v0.16.0 container_name: node_exporter user: root privileged: true network_mode: "host" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)' restart: unless-stopped ports: - 9100:9100 cpu_shares: 14 mem_limit: 50m logging: driver: "json-file" options: max-size: "200M" labels: - "SERVICE_TAGS=prometheus" cadvisor: image: google/cadvisor:v0.32.0 container_name: cadvisor network_mode: "host" volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro command: --listen_ip="0.0.0.0" --port=18880 restart: unless-stopped ports: - 18880:18880 cpu_shares: 14 mem_limit: 50m logging: driver: "json-file" options: max-size: "200M" labels: SERVICE_TAGS: prometheusprometheus在prometheus中,我修改了保存的时间,并且配置文件也做了调整 - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention=45d' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--web.enable-lifecycle'其中,对于每个容器类型,做了区分,如:cadvisor,如下 - job_name: 'cadvisor' metrics_path: /metrics scheme: http consul_sd_configs: - server: 127.0.0.1:8500 services: ['cadvisor'] relabel_configs: - source_labels: ['__meta_consul_service'] regex: '(.*)' target_label: 'job' replacement: '$1' - source_labels: [__meta_consul_tags] target_label: tags - source_labels: ['__meta_consul_service_address'] regex: '(.*)' target_label: 'instance' replacement: '$1' - source_labels: ['__meta_consul_service_address', '__meta_consul_service_port'] regex: '(.*);(.*)' target_label: '__address__' replacement: '$1:$2' - source_labels: ['__meta_consul_tags'] regex: ',(prometheus|app),' target_label: 'group' replacement: '$1'自动发现字段 relabel_configs: - source_labels: ['__meta_consul_service'] regex: '(.*)' target_label: 'job' replacement: '$1' - source_labels: [__meta_consul_tags] target_label: tags - source_labels: ['__meta_consul_service_address'] regex: '(.*)' target_label: 'instance' replacement: '$1' - source_labels: ['__meta_consul_service_address', '__meta_consul_service_port'] regex: '(.*);(.*)' target_label: '__address__' replacement: '$1:$2'其中:__meta_consul_service发现的地址会重写成job__meta_consul_tags的值重写为tags__meta_consul_service_address获取的ip重写为instance__meta_consul_service_address', '__meta_consul_service_port'分别是ip和端口$1和$2分别对应第一个参数和第二个参数。分组但是这还不够,当发现后,这些信息必须有一个标签来做区分,否则我们是不能够用来灵活使用的,我们需要一个标签,如下所示 - source_labels: ['__meta_consul_tags'] regex: ',(prometheus|app),' target_label: 'group' replacement: '$1'如果__meta_consul_tags中包含prometheus字段或者app字段,就重新成一个组,就是上面所示的意思这也说明了为什么在每个容器后都会打上一个标签的原因 labels: SERVICE_TAGS: prometheus我们可以利用这个标签对项目做分组,灵活的划分grafana有了这个标签之后,我们可以对不同的标签进行划分。如下在项目组中,是有多个组,在每个组下,有多台机器,这里使用ip划分。但是,这样还是不够,有一些容器,没有在k8s中,他在宿主机上运行,那么这个模板就不能完美使用,重新编辑后的,就成了这样子。根据项目,和项目中的容器类型,根据容器来看对应容器下的主机情况。在github中还有其他的模板以供参考。对于如何来分组这个问题,我们首先要有一个标签,或者不同的表情,根据标签来分大组,如类型,项目,web等,而后根据每个标签,或者ip,或者容器名称组合一个序列来看项目地址:https://github.com/marksugar/pcr,这个工具我在使用,逐步更新维护,如果对你有帮助,请移步后顺手点个星星。延伸阅读:https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config延伸阅读:https://prometheus.io/blog/2015/06/01/advanced-service-discovery/
2018年12月25日
5,559 阅读
0 评论
0 点赞
2018-10-26
linuxea:zabbix4.0通过slack发送警报
此前做了Telegram的警报发送,发现误报较多,尝试使用slack的方式,发现简洁明了,颇为好用。代码来自github,可参阅本章。此前的Telegram和slack都没有做警报收敛,在github之上有现成的警报收敛,感兴趣可以看看。zabbix配置下载slack.sh脚本,存放在/usr/lib/zabbix/alertscripts下[root@DT_Node-172_25_250_249 ~]# curl -Lk https://raw.githubusercontent.com/ericoc/zabbix-slack-alertscript/master/slack.sh -o /usr/lib/zabbix/alertscripts/slack.sh [root@DT_Node-172_25_250_249 /usr/lib/zabbix/alertscripts]# ll total 52 -rw-r--r-- 1 root root 1580 Oct 25 10:10 slack.sh打开配置AlertScriptsPath=/usr/lib/zabbix/alertscripts[root@DT_Node-172_25_250_249 /usr/lib/zabbix/alertscripts]# grep AlertScriptsPath /etc/zabbix/zabbix_server.conf ### Option: AlertScriptsPath # AlertScriptsPath=${datadir}/zabbix/alertscripts AlertScriptsPath=/usr/lib/zabbix/alertscripts [root@DT_Node-172_25_250_249 /usr/lib/zabbix/alertscripts]# slack创建一个频道,使用webhook打开slack创建频道在webhook页面选中创建的频道获取webhook url将URL写入到脚本中url='https://hooks.slack.com/services/TDP9T4YH4UDP/frkSC=' username='linuxea.com'命令行测试[root@DT_Node ~]# bash slack.sh '#linuxea-zabbix-monitor' PROBLEM '!' okzabbix web配置配置Medi types配置Action其中Default message简短为好配置Operations发送的用户媒介Resolved 也是如此而后发送的报警信息大致如下
2018年10月26日
4,227 阅读
0 评论
0 点赞
2018-10-25
linuxea:Zabbix4.0通过Telegram发送告警
zabbix 配置 Zabbix-in-Telegram加入你在香港或者其他地方,需要使用Telegram完成zabbix监控告警功能,你可以参考本章。如果在国内,推荐使用丁丁,或者微信,以及QQ等通讯工具。先决条件1,打开zabbix配置AlertScriptsPath=/usr/lib/zabbix/alertscripts2,申请Telegram机器人申请机器人参考: https://core.telegram.org/bots#creating-a-new-bot而后参考Zabbix-in-Telegram进行配置:https://github.com/ableev/Zabbix-in-Telegram配置Zabbix-in-Telegram克隆代码[root@Linuxea_Node ~]# git clone https://github.com/ableev/Zabbix-in-Telegram.git Cloning into 'Zabbix-in-Telegram'... remote: Enumerating objects: 9, done. remote: Counting objects: 100% (9/9), done. remote: Compressing objects: 100% (9/9), done. remote: Total 474 (delta 3), reused 1 (delta 0), pack-reused 465 Receiving objects: 100% (474/474), 169.39 KiB | 182.00 KiB/s, done. Resolving deltas: 100% (269/269), done.安装pip[root@Linuxea_Node ~]# yum install python-pip安装requirements.txt文件中的依赖[root@Linuxea_Node ~]# cd Zabbix-in-Telegram/ [root@Linuxea_Node ~/Zabbix-in-Telegram]# pip install -r requirements.txt复制zbxtg.py zbxtg_settings.example.py zbxtg_group.py 到/usr/lib/zabbix/alertscripts/[root@Linuxea_Node ~/Zabbix-in-Telegram]# cp zbxtg.py /usr/lib/zabbix/alertscripts[root@Linuxea_Node ~/Zabbix-in-Telegram]# cp zbxtg_settings.example.py /usr/lib/zabbix/alertscripts/[root@Linuxea_Node ~/Zabbix-in-Telegram]# cp zbxtg_group.py /usr/lib/zabbix/alertscripts/而后编辑zbxtg_settings.py,主要修改三个配置信息,如下:tg_key = "KEY" # telegram bot api keyzbx_server = "http://www.linuxea.com/zabbix/" # zabbix server full url zbx_api_user = "Admin" zbx_api_pass = "zabbix"tg_key就是申请机器人时候生成的。zabbix的用户名密码必须是能够登录的,且有权限的,可以使用Admin你可以通过 ./zbxtg.py "group name And username" "test" --group进行测试(你必须先创建群组,而后将bot拉入群内)配置zabbix-server-web创建Media types创建必要的Media types创建用户创建用户为后面添加的部分,此前缺少的部分,由于环境不一样,截图有些不同。但是大致的步骤肯定是一样的创建组我们创建必要的用户来进行发送报警信息,为了方便,我们理应创建一个组Administratior-> User group -> Create user group 在user group中填写创建的名字而后在Permissions中选择读权限,而后在select中选择所有,而后点击Add添加组到Permissions,最后Add创建User group创建用户Administratior-> Users -> Create user 在user 中填写创建的名字。在groups中点击select,在弹出的对话框中选择刚创建的telegram_group组,如下图而后在Media中,点击Add在弹出的对话框中,在type中选择创建过的Media types。而send to在本章Telegram的案例中是指Telegram的群名(Zabbix-in-Telegram)。而后选择之发送Disaster的报警 创建 action登录到页面中在configuration->Actions->Triggers->Create action创建一个action而后在Action的New condition中选择Trigger severity 选择High 和Disaster当发生Disaster和High 就会触发这个动作在Operations中,填写触发后的message,内容如下Default subject:告警主机: {HOST.NAME}问题详情: {ITEM.NAME}:{ITEM.VALUE} 告警时间: {EVENT.DATE} {EVENT.TIME} 告警等级: {TRIGGER.SEVERITY} 告警信息: {TRIGGER.NAME} 告警项目: {TRIGGER.KEY1} 当前状态: {TRIGGER.STATUS}.{ITEM.VALUE} 事件ID: {EVENT.ID} zbxtg;graphs zbxtg;graphs_period=10800 zbxtg;itemid:{ITEM.ID1} zbxtg;title:{HOST.HOST} - {TRIGGER.NAME}而后添加用户权限和媒介,如下图Recovery operations中与Operations一样的操作Default subject:恢复主机: {HOST.NAME}问题详情: {ITEM.NAME}:{ITEM.VALUE} 恢复时间: {EVENT.DATE} {EVENT.TIME} 事件等级: {TRIGGER.SEVERITY} 恢复项目: {TRIGGER.KEY1} 当前状态: {TRIGGER.STATUS}.{ITEM.VALUE} 事件ID: {EVENT.ID} zbxtg;graphs zbxtg;graphs_period=10800 zbxtg;itemid:{ITEM.ID1} zbxtg;title:{HOST.HOST} - {TRIGGER.NAME}而后将机器人拉入到群内,模拟一次故障成功发送图到Telegram中
2018年10月25日
5,709 阅读
0 评论
0 点赞
2018-08-16
linuxea:logstash6和filebeat6配置笔记
开始配置filebeat,在这之前,你或许需要了解下之前的配置结构[ELK6.3.2安装与配置[跨网络转发思路]](https://www.linuxea.com/1889.html),我又将配置优化了下。仅仅因为我一个目录下有多个nginx日志。配置filebeat之前使用过用一个个日志来做单个的日志过滤,现在使用*.log匹配所有以log结尾的日志在发送到redis中在配置filebeat中将/data/wwwlogs/的所有以.log结尾的文件都会被收集到%{[fields.list_id]的变量名称中,在这个示例中是100_nginx_access,output到redis,key名称则是100_nginx_access,这其中包含error日志[root@linuxea-0702-DTNode01 ~]# cat /etc/filebeat/filebeat.yml filebeat.prospectors: - type: log enabled: true paths: - /data/wwwlogs/*.log fields: list_id: 172_nginx_access exclude_files: - ^access - ^error - \.gz$ filebeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: false setup.template.settings: index.number_of_shards: 3 output.redis: hosts: ["47.90.33.131:6379"] password: "OTdmOWI4ZTM4NTY1M2M4OTZh" db: 2 timeout: 5 key: "%{[fields.list_id]:unknow}"排除文件可以这样exclude_files: ["/var/wwwlogs/error.log"]为了提升性能,redis关闭持久存储save "" #save 900 1 #save 300 10 #save 60 10000 appendonly no aof-rewrite-incremental-fsync nologstash配置文件假如你也是rpm安装的logstash的话,那就巧了,我也是在logstash中修pipeline.workers的线程数和ouput的线程数以及batch.size,线程数可以和内核数量持平,如果是单独运行logstash,可以设置稍大些。配置文件过滤后就是这样[root@linuxea-VM-Node117 /etc/logstash]# cat logstash.yml node.name: node1 path.data: /data/logstash/data #path.config: *.yml log.level: info path.logs: /data/logstash/logs pipeline.workers: 16 pipeline.output.workers: 16 pipeline.batch.size: 10000 pipeline.batch.delay: 10pipelines 配置文件pipelines文件中包含了所有的日志配置文件,也就是管道存放的位置和启动的workers[root@linuxea-VM-Node117 /etc/logstash]# cat pipelines.yml # This file is where you define your pipelines. You can define multiple. # For more information on multiple pipelines, see the documentation: # https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html - pipeline.id: 172_nginx_access pipeline.workers: 1 path.config: "/etc/logstash/conf.d/172_nginx_access.conf" - pipeline.id: 76_nginx_access pipeline.workers: 1 path.config: "/etc/logstash/conf.d/76_nginx_access.conf"jvm.optionsjvm.options配置文件中修改xms的起始大小和最大的大小,视配置而定-Xms4g -Xmx7g文件目录树:[root@linuxea-VM-Node117 /etc/logstash]# tree ./ ./ |-- conf.d | |-- 172_nginx_access.conf | `-- 76_nginx_access.conf |-- GeoLite2-City.mmdb |-- jvm.options |-- log4j2.properties |-- logstash.yml |-- patterns.d | |-- nginx | |-- nginx2 | `-- nginx_error |-- pipelines.yml `-- startup.options 2 directories, 20 filesnginx配置文件在conf.d目录下存放是单个配置文件,他可以存放多个。单个大致这样的input { redis { host => "47.31.21.369" port => "6379" key => "172_nginx_access" data_type => "list" password => "OTdmOM4OTZh" threads => "5" db => "2" } } filter { if [fields][list_id] == "172_nginx_access" { grok { patterns_dir => [ "/etc/logstash/patterns.d/" ] match => { "message" => "%{NGINXACCESS}" } match => { "message" => "%{NGINXACCESS_B}" } match => { "message" => "%{NGINXACCESS_ERROR}" } match => { "message" => "%{NGINXACCESS_ERROR2}" } overwrite => [ "message" ] remove_tag => ["_grokparsefailure"] timeout_millis => "0" } geoip { source => "clent_ip" target => "geoip" database => "/etc/logstash/GeoLite2-City.mmdb" } useragent { source => "User_Agent" target => "userAgent" } urldecode { all_fields => true } mutate { gsub => ["User_Agent","[\"]",""] #将user_agent中的 " 换成空 convert => [ "response","integer" ] convert => [ "body_bytes_sent","integer" ] convert => [ "bytes_sent","integer" ] convert => [ "upstream_response_time","float" ] convert => [ "upstream_status","integer" ] convert => [ "request_time","float" ] convert => [ "port","integer" ] } date { match => [ "timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ] } } } output { if [fields][list_id] == "172_nginx_access" { elasticsearch { hosts => ["10.10.240.113:9200","10.10.240.114:9200"] index => "logstash-172_nginx_access-%{+YYYY.MM.dd}" user => "elastic" password => "dtopsadmin" } } stdout {codec => rubydebug} }其中: match字段的文件位置和在/etc/logstash/patterns.d/ patterns_dir => [ "/etc/logstash/patterns.d/" ] match => { "message" => "%{NGINXACCESS}" } match => { "message" => "%{NGINXACCESS_B}" } match => { "message" => "%{NGINXACCESS_ERROR}" } match => { "message" => "%{NGINXACCESS_ERROR2}" }nginx日志grok字段[root@linuxea-VM-Node117 /etc/logstash]# cat patterns.d/nginx NGUSERNAME [a-zA-Z\.\@\-\+_%]+ NGUSER %{NGUSERNAME} NGINXACCESS %{IP:clent_ip} (?:-|%{USER:ident}) \[%{HTTPDATE:log_date}\] \"%{WORD:http_verb} (?:%{PATH:baseurl}\?%{NOTSPACE:params}(?: HTTP/%{NUMBER:http_version})?|%{DATA:raw_http_request})\" (%{IPORHOST:url_domain}|%{URIHOST:ur_domain}|-)\[(%{BASE16FLOAT:request_time}|-)\] %{NOTSPACE:request_body} %{QS:referrer_rul} %{GREEDYDATA:User_Agent} \[%{GREEDYDATA:ssl_protocol}\] \[(?:%{GREEDYDATA:ssl_cipher}|-)\]\[%{NUMBER:time_duration}\] \[%{NUMBER:http_status_code}\] \[(%{BASE10NUM:upstream_status}|-)\] \[(%{NUMBER:upstream_response_time}|-)\] \[(%{URIHOST:upstream_addr}|-)\] [root@linuxea-VM-Node117 /etc/logstash]# 由于使用了4层,nginx日志被报错在编译时候的日志格式,也做了grok[root@linuxea-VM-Node117 /etc/logstash]# cat patterns.d/nginx2 NGUSERNAME [a-zA-Z\.\@\-\+_%]+ NGUSER %{NGUSERNAME} NGINXACCESS_B %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) (?:-|%{USER:ident}) \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:http_status_code} %{NOTSPACE:request_body} "%{GREEDYDATA:User_Agent}" [root@linuxea-VM-Node117 /etc/logstash]# nginx错误日志的grok[root@linuxea-VM-Node117 /etc/logstash]# cat patterns.d/nginx_error NGUSERNAME [a-zA-Z\.\@\-\+_%]+ NGUSER %{NGUSERNAME} NGINXACCESS_ERROR (?<time>\d{4}/\d{2}/\d{2}\s{1,}\d{2}:\d{2}:\d{2})\s{1,}\[%{DATA:err_severity}\]\s{1,}(%{NUMBER:pid:int}#%{NUMBER}:\s{1,}\*%{NUMBER}|\*%{NUMBER}) %{DATA:err_message}(?:,\s{1,}client:\s{1,}(?<client_ip>%{IP}|%{HOSTNAME}))(?:,\s{1,}server:\s{1,}%{IPORHOST:server})(?:, request: %{QS:request})?(?:, host: %{QS:client_ip})?(?:, referrer: \"%{URI:referrer})? NGINXACCESS_ERROR2 (?<time>\d{4}/\d{2}/\d{2}\s{1,}\d{2}:\d{2}:\d{2})\s{1,}\[%{DATA:err_severity}\]\s{1,}%{GREEDYDATA:err_message} [root@linuxea-VM-Node117 /etc/logstash]#
2018年08月16日
4,562 阅读
0 评论
0 点赞
1
2
...
9