首页
About Me
推荐
weibo
github
Search
1
linuxea:gitlab-ci之docker镜像质量品质报告
49,197 阅读
2
linuxea:如何复现查看docker run参数命令
21,468 阅读
3
Graylog收集文件日志实例
18,257 阅读
4
git+jenkins发布和回滚示例
17,882 阅读
5
linuxea:jenkins+pipeline+gitlab+ansible快速安装配置(1)
17,778 阅读
ops
Openvpn
Sys Basics
rsync
Mail
NFS
Other
Network
HeartBeat
server 08
Code
Awk
Shell
Python
Golang
virtualization
KVM
Docker
openstack
Xen
kubernetes
kubernetes-cni
Service Mesh
Data
Mariadb
PostgreSQL
MongoDB
Redis
MQ
Ceph
TimescaleDB
kafka
surveillance system
zabbix
ELK Stack
Open-Falcon
Prometheus
victoriaMetrics
Web
apache
Tomcat
Nginx
自动化
Puppet
Ansible
saltstack
Proxy
HAproxy
Lvs
varnish
更多
互联咨询
最后的净土
软件交付
持续集成
gitops
devops
登录
Search
标签搜索
kubernetes
docker
zabbix
Golang
mariadb
持续集成工具
白话容器
linux基础
nginx
elk
dockerfile
Gitlab-ci/cd
最后的净土
基础命令
jenkins
docker-compose
gitops
haproxy
saltstack
Istio
marksugar
累计撰写
676
篇文章
累计收到
140
条评论
首页
栏目
ops
Openvpn
Sys Basics
rsync
Mail
NFS
Other
Network
HeartBeat
server 08
Code
Awk
Shell
Python
Golang
virtualization
KVM
Docker
openstack
Xen
kubernetes
kubernetes-cni
Service Mesh
Data
Mariadb
PostgreSQL
MongoDB
Redis
MQ
Ceph
TimescaleDB
kafka
surveillance system
zabbix
ELK Stack
Open-Falcon
Prometheus
victoriaMetrics
Web
apache
Tomcat
Nginx
自动化
Puppet
Ansible
saltstack
Proxy
HAproxy
Lvs
varnish
更多
互联咨询
最后的净土
软件交付
持续集成
gitops
devops
页面
About Me
推荐
weibo
github
搜索到
85
篇与
的结果
2022-05-15
linuxea:kube-prometheus远程存储victoriametrics
我们知道,在使用promentheus的过程中,默认的数据量一旦到一个量级后,查询区间的数据会非常缓慢,甚至一个查询就可能导致promentheus的崩溃,尽管我们不需要存储多久的数据,但是集群pod在一定的数量后,短期的数据仍然非常多,对于Promentheus本身的存储引擎来讲,仍是一个不小的问题,而使用外部存储就显得很有必要。早期流行的influxDB,由于社区对Promentheus并不友好,因此早些就放弃。此前,尝试了Prometheus远程存储Promscale和TimescaleDB测试,而后在讨论中发现VictoriaMetrics是更可取的方式。而VictoriaMetrics也有自己的一套系统监控。而在官方的介绍中,VictoriaMetrics强烈diss了TimescaleDBIt provides high data compression, so up to 70x more data points may be crammed into limited storage comparing to TimescaleDB and up to 7x less storage space is required compared to Prometheus, Thanos or Cortex.VictoriaMetrics可用于 Prometheus 监控数据做长期远程存储的时序数据库之一,而在github上是这样介绍的,截取部分如下可以直接用于 Grafana 作为 Prometheus 数据源使用指标数据摄取和查询具备高性能和良好的可扩展性,性能比 InfluxDB 和 TimescaleDB 高出 20 倍内存方面也做了优化,比 InfluxDB 少 10x 倍,比 Prometheus、Thanos 或 Cortex 少 7 倍其他有能够理解的部分话术针对具有高延迟 IO 和低 IOPS 的存储进行了优化提供全局的查询视图,多个 Prometheus 实例或任何其他数据源可能会将数据摄取到 VictoriaMetricsVictoriaMetrics 由一个没有外部依赖的小型可执行文件组成所有的配置都是通过明确的命令行标志和合理的默认值完成的所有数据都存储在 - storageDataPath 命令行参数指向的目录中可以使用 vmbackup/vmrestore 工具轻松快速地从实时快照备份到 S3 或 GCS 对象存储中支持从第三方时序数据库获取数据源由于存储架构原因,它可以保护存储在非正常关机(即 OOM、硬件重置或 kill -9)时免受数据损坏同样支持指标的 relabel 操作注意VictoriaMetrics 不支持prometheus本身读取,但是为了解决报警的问题,开发人员建议配置--storage.tsdb.retention.time=24h保留24小时的数据在prometheus中,而其他的数据写入到远程VictoriaMetrics ,通过grafana展示。VictoriaMetrics wiki说不支持prometheus读取,因为它发送的数据量很大; remote_read api 可以解决警报问题。我们可以启动一个 prometheus 实例,它只有 remote_read 配置部分和规则部分。victoriaMetrics 警报非常好!由于Prometheus中的这个问题,Prometheus 远程读取 API 不是为读取由其他 Prometheus 实例写入远程存储的数据而设计的。至于 Prometheus 中的警报,则将 Prometheus 本地存储保留设置为涵盖所有已配置警报规则的持续时间。通常 24 小时就足够了:--storage.tsdb.retention.time=24h. 在这种情况下,Prometheus 将对本地存储的数据执行警报规则,同时remote_write像往常一样将所有数据复制到配置的 url。而这些在github的wiki中以及为什么 VictoriaMetrics 不支持Prometheus 远程读取 API?有过说明远程读取 API 需要在给定时间范围内传输所有请求指标的所有原始数据。例如,如果一个查询包含 1000 个指标,每个指标有 10K 个值,那么远程读取 API 必须1000*10K向 Prometheus 返回 =10M 个指标值。这是缓慢且昂贵的。Prometheus 的远程读取 API 不适用于查询外部数据——也就是global query view. 有关详细信息,请参阅此问题。因此,只需通过vmui、Prometheus Querying API 或Grafana 中的 Prometheus 数据源直接查询 VictoriaMetrics 。VictoriaMetrics在VictoriaMetrics 中介绍如下VictoriaMetrics uses their modified version of LSM tree (Logging Structure Merge Tree). All the tables and indexes on the disk are immutable once created. When it's making the snapshot, they just create the hard link to the immutable files.VictoriaMetrics stores the data in MergeTree, which is from ClickHouse and similar to LSM. The MergeTree has particular design decision compared to canonical LSM.MergeTree is column-oriented. Each column is stored separately. And the data is sorted by the "primary key", and the "primary key" doesn't have to be unique. It speeds up the look-up through the "primary key", and gets the better compression ratio. The "parts" is similar to SSTable in LSM; it can be merged into bigger parts. But it doesn't have strict levels.The Inverted Index is built on "mergeset" (A data structure built on top of MergeTree ideas). It's used for fast lookup by given the time-series selector.提到的技术点, LSM 树,以及MergeTreeVictoriaMetrics 将数据存储在 MergeTree 中,MergeTree 来自 ClickHouse,类似于 LSM。与规范 LSM 相比,MergeTree 具有特定的设计决策。MergeTree 是面向列的。每列单独存储。并且数据按“主键”排序,“主键”不必是唯一的。它通过“主键”加快查找速度,获得更好的压缩比。“部分”类似于 LSM 中的 SSTable;它可以合并成更大的部分。但它没有严格的等级。倒排索引建立在“mergeset”(建立在 MergeTree 思想之上的数据结构)之上。通过给定时间序列选择器,它用于快速查找。为了能够有更多的理解,可以参考LSM Tree原理详解):https://www.jianshu.com/p/b43b856e09bb应用到kube-prometheus对照如下kubernetes版本安装对应的kube-prometheus版本kube-prometheus stackKubernetes 1.19Kubernetes 1.20Kubernetes 1.21Kubernetes 1.22Kubernetes 1.23release-0.7✔✔✗✗✗release-0.8✗✔✔✗✗release-0.9✗✗✔✔✗release-0.10✗✗✗✔✔main✗✗✗✔✔Quickstart找到符合集群对应的版本进行安装,如果你是ack,需要卸载ack-arms-prometheus替换镜像k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1 v5cn/prometheus-adapter:v0.9.1k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.4.2 bitnami/kube-state-metrics:2.4.2quay.io/brancz/kube-rbac-proxy:v0.12.0 bitnami/kube-rbac-proxy:0.12.0开始部署$ cd kube-prometheus $ git checkout main kubectl.exe create -f .\manifests\setup\ kubectl.exe create -f .\manifests配置ingress-nginx> kubectl.exe -n monitoring get svc NAME TYPE CLUSTER-IP PORT(S) alertmanager-main ClusterIP 192.168.31.49 9093/TCP,8080/TCP alertmanager-operated ClusterIP None 9093/TCP,9094/TCP,9094/UDP blackbox-exporter ClusterIP 192.168.31.69 9115/TCP,19115/TCP grafana ClusterIP 192.168.130.3 3000/TCP kube-state-metrics ClusterIP None 8443/TCP,9443/TCP node-exporter ClusterIP None 9100/TCP prometheus-adapter ClusterIP 192.168.13.123 443/TCP prometheus-k8s ClusterIP 192.168.118.39 9090/TCP,8080/TCP prometheus-operated ClusterIP None 9090/TCP prometheus-operator ClusterIP None 8443/TCP ingress-nginxapiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: monitoring-ui namespace: monitoring spec: ingressClassName: nginx rules: - host: local.grafana.com http: paths: - path: / pathType: Prefix backend: service: name: grafana port: number: 3000 --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: prometheus-ui namespace: monitoring spec: ingressClassName: nginx rules: - host: local.prom.com http: paths: - path: / pathType: Prefix backend: service: name: prometheus-k8s port: number: 9090配置nfs测试apiVersion: apps/v1 kind: Deployment metadata: name: nfs-client-provisioner labels: app: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default spec: replicas: 1 strategy: type: Recreate selector: matchLabels: app: nfs-client-provisioner template: metadata: labels: app: nfs-client-provisioner spec: serviceAccountName: nfs-client-provisioner containers: - name: nfs-client-provisioner image: quay.io/external_storage/nfs-client-provisioner:latest imagePullPolicy: IfNotPresent volumeMounts: - name: nfs-client-root mountPath: /persistentvolumes env: - name: PROVISIONER_NAME value: fuseim.pri/ifs - name: NFS_SERVER value: 192.168.3.19 - name: NFS_PATH value: /data/nfs-k8s volumes: - name: nfs-client-root nfs: server: 192.168.3.19 path: /data/nfs-k8s --- apiVersion: v1 kind: ServiceAccount metadata: name: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: nfs-client-provisioner-runner rules: - apiGroups: [""] resources: ["persistentvolumes"] verbs: ["get", "list", "watch", "create", "delete"] - apiGroups: [""] resources: ["persistentvolumeclaims"] verbs: ["get", "list", "watch", "update"] - apiGroups: ["storage.k8s.io"] resources: ["storageclasses"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["events"] verbs: ["create", "update", "patch"] --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: run-nfs-client-provisioner subjects: - kind: ServiceAccount name: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default roleRef: kind: ClusterRole name: nfs-client-provisioner-runner apiGroup: rbac.authorization.k8s.io --- kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: name: leader-locking-nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default rules: - apiGroups: [""] resources: ["endpoints"] verbs: ["get", "list", "watch", "create", "update", "patch"] --- kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: leader-locking-nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default subjects: - kind: ServiceAccount name: nfs-client-provisioner # replace with namespace where provisioner is deployed namespace: default roleRef: kind: Role name: leader-locking-nfs-client-provisioner apiGroup: rbac.authorization.k8s.iovm配置创建一个pvc-victoriametricsapiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: nfs-storage namespace: default provisioner: fuseim.pri/ifs # or choose another name, must match deployment's env PROVISIONER_NAME' parameters: archiveOnDelete: "false" # Supported policies: Delete、 Retain , default is Delete reclaimPolicy: Retain --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: pvc-victoriametrics namespace: monitoring spec: accessModes: - ReadWriteMany storageClassName: nfs-storage resources: requests: storage: 10Gi准备pvc[linuxea.com ~/victoriametrics]# kubectl apply -f pvc.yaml storageclass.storage.k8s.io/nfs-storage created persistentvolumeclaim/pvc-victoriametrics created [linuxea.com ~/victoriametrics]# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM ... pvc-97bea5fe-0131-4fb5-aaa9-66eee0802cb4 10Gi RWX Retain Bound monitoring/pvc-victoriametrics ... [linuxea.com ~/victoriametrics]# kubectl get pvc -A NAMESPACE NAME STATUS VOLUME CAPACITY ... monitoring pvc-victoriametrics Bound pvc-97bea5fe-0131-4fb5-aaa9-66eee0802cb4 10Gi 创建victoriametrics,并配置上面的pvc1w : 一周# vm-grafana.yaml apiVersion: apps/v1 kind: Deployment metadata: name: victoria-metrics namespace: monitoring spec: selector: matchLabels: app: victoria-metrics template: metadata: labels: app: victoria-metrics spec: containers: - name: vm image: victoriametrics/victoria-metrics:v1.76.1 imagePullPolicy: IfNotPresent args: - -storageDataPath=/var/lib/victoria-metrics-data - -retentionPeriod=1w ports: - containerPort: 8428 name: http resources: limits: cpu: "1" memory: 2048Mi requests: cpu: 100m memory: 512Mi readinessProbe: httpGet: path: /health port: 8428 initialDelaySeconds: 30 timeoutSeconds: 30 livenessProbe: httpGet: path: /health port: 8428 initialDelaySeconds: 120 timeoutSeconds: 30 volumeMounts: - mountPath: /var/lib/victoria-metrics-data name: victoriametrics-storage volumes: - name: victoriametrics-storage persistentVolumeClaim: claimName: nas-csi-pvc-oms-fat-victoriametrics --- apiVersion: v1 kind: Service metadata: name: victoria-metrics namespace: monitoring spec: ports: - name: http port: 8428 protocol: TCP targetPort: 8428 selector: app: victoria-metrics type: ClusterIPapply[linuxea.com ~/victoriametrics]# kubectl apply -f vmctoriametrics.yaml deployment.apps/victoria-metrics created service/victoria-metrics created [linuxea.com ~/victoriametrics]# kubectl -n monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 88 268d blackbox-exporter-55c457d5fb-6rc8m 3/3 Running 114 260d grafana-756dc9b545-b2skg 1/1 Running 38 260d kube-state-metrics-76f6cb7996-j2hx4 3/3 Running 153 260d node-exporter-4hxzp 2/2 Running 120 316d node-exporter-54t9p 2/2 Running 124 316d node-exporter-8rfht 2/2 Running 120 316d node-exporter-hqzzn 2/2 Running 126 316d prometheus-adapter-59df95d9f5-7shw5 1/1 Running 78 260d prometheus-k8s-0 2/2 Running 89 268d prometheus-operator-7775c66ccf-x2wv4 2/2 Running 115 260d promoter-66f6dd475c-fdzrx 1/1 Running 3 8d victoria-metrics-56d47f6fb-qmthh 0/1 Running 0 15s [linuxea.com ~/victoriametrics]# kubectl -n monitoring get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main NodePort 10.68.30.147 <none> 9093:30092/TCP 316d alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 316d blackbox-exporter ClusterIP 10.68.25.245 <none> 9115/TCP,19115/TCP 316d etcd-k8s ClusterIP None <none> 2379/TCP 316d external-node-k8s ClusterIP None <none> 9100/TCP 315d external-pve-k8s ClusterIP None <none> 9221/TCP 305d external-windows-node-k8s ClusterIP None <none> 9182/TCP 316d grafana NodePort 10.68.133.224 <none> 3000:30091/TCP 316d kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 316d node-exporter ClusterIP None <none> 9100/TCP 316d prometheus-adapter ClusterIP 10.68.138.175 <none> 443/TCP 316d prometheus-k8s NodePort 10.68.207.185 <none> 9090:30090/TCP 316d prometheus-operated ClusterIP None <none> 9090/TCP 316d prometheus-operator ClusterIP None <none> 8443/TCP 316d promoter ClusterIP 10.68.26.69 <none> 8080/TCP 11d victoria-metrics ClusterIP 10.68.225.139 <none> 8428/TCP 18s修改prometheus的远程存储配置,我们主要修改如下,其他参数可在官方文档查看首先修改远程写如到vm remoteWrite: - url: "http://victoria-metrics:8428/api/v1/write" queueConfig: capacity: 5000 remoteTimeout: 30s并且prometheus的存储时间为1天retention: 1d一天的本地存储只是为了应对告警,而远程写入到vm后通过grafana来看Prometheus-prometheus.yamlapiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.35.0 name: k8s namespace: monitoring spec: retention: 1d alerting: alertmanagers: - apiVersion: v2 name: alertmanager-main namespace: monitoring port: web enableFeatures: [] externalLabels: {} image: quay.io/prometheus/prometheus:v2.35.0 nodeSelector: kubernetes.io/os: linux podMetadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/instance: k8s app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.35.0 podMonitorNamespaceSelector: {} podMonitorSelector: {} probeNamespaceSelector: {} probeSelector: {} replicas: 1 resources: requests: memory: 400Mi remoteWrite: - url: "http://victoria-metrics:8428/api/v1/write" queueConfig: capacity: 5000 remoteTimeout: 30s ruleNamespaceSelector: {} ruleSelector: {} securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: 2.35.0而此时的配置不出意外会被应用到URL/configremote_write: - url: http://victoria-metrics:8428/api/v1/write remote_timeout: 5m follow_redirects: true queue_config: capacity: 5000 max_shards: 200 min_shards: 1 max_samples_per_send: 500 batch_send_deadline: 5s min_backoff: 30ms max_backoff: 100ms metadata_config: send: true send_interval: 1m查看日志level=info ts=2022-04-28T15:26:12.047Z caller=main.go:944 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml ts=2022-04-28T15:26:12.053Z caller=dedupe.go:112 component=remote level=info remote_name=1a1964 url=http://victoria-metrics:8428/api/v1/write msg="Starting WAL watcher" queue=1a1964 ts=2022-04-28T15:26:12.053Z caller=dedupe.go:112 component=remote level=info remote_name=1a1964 url=http://victoria-metrics:8428/api/v1/write msg="Starting scraped metadata watcher" ts=2022-04-28T15:26:12.053Z caller=dedupe.go:112 component=remote level=info remote_name=1a1964 url=http://victoria-metrics:8428/api/v1/write msg="Replaying WAL" queue=1a1964 .... totalDuration=55.219178ms remote_storage=85.51µs web_handler=440ns query_engine=719ns scrape=45.6µs scrape_sd=1.210328ms notify=4.99µs notify_sd=352.209µs rules=47.503195ms回到nfs查看[root@Node-172_16_100_49 /data/nfs-k8s/monitoring-pvc-victoriametrics-pvc-97bea5fe-0131-4fb5-aaa9-66eee0802cb4]# ll total 0 drwxr-xr-x 4 root root 48 Apr 28 22:37 data -rw-r--r-- 1 root root 0 Apr 28 22:37 flock.lock drwxr-xr-x 5 root root 71 Apr 28 22:37 indexdb drwxr-xr-x 2 root root 43 Apr 28 22:37 metadata drwxr-xr-x 2 root root 6 Apr 28 22:37 snapshots drwxr-xr-x 3 root root 27 Apr 28 22:37 tmp修改grafana的配置此时看到的数据是用promenteus中获取到的,修改grefana来从vm读取数据 datasources.yaml: |- { "apiVersion": 1, "datasources": [ { "access": "proxy", "editable": false, "name": "prometheus", "orgId": 1, "type": "prometheus", "url": "http://victoria-metrics:8428", "version": 1 } ] }顺便修改时区stringData: # 修改 时区 grafana.ini: | [date_formats] default_timezone = CST如下apiVersion: v1 kind: Secret metadata: labels: app.kubernetes.io/component: grafana app.kubernetes.io/name: grafana app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 8.5.0 name: grafana-datasources namespace: monitoring stringData: # 修改链接的地址 datasources.yaml: |- { "apiVersion": 1, "datasources": [ { "access": "proxy", "editable": false, "name": "prometheus", "orgId": 1, "type": "prometheus", "url": "http://victoria-metrics:8428", "version": 1 } ] } type: Opaque --- apiVersion: v1 kind: Secret metadata: labels: app.kubernetes.io/component: grafana app.kubernetes.io/name: grafana app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 8.5.0 name: grafana-config namespace: monitoring stringData: # 修改 时区 grafana.ini: | [date_formats] default_timezone = CST type: Opaque # grafana: # sidecar: # datasources: # enabled: true # label: grafana_datasource # searchNamespace: ALL # defaultDatasourceEnabled: false # additionalDataSources: # - name: Loki # type: loki # url: http://loki-stack.loki-stack:3100/ # access: proxy # - name: VictoriaMetrics # type: prometheus # url: http://victoria-metrics-single-server.victoria-metrics-single:8428 # access: proxy而此时的datasources就变成了vm,远程写入到了vm,grafana读取的是vm,而Prometheus还是读的是prometheus监控vmdashboards与版本有关,https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/dashboards并且添加监控# victoriametrics-metrics apiVersion: v1 kind: Service metadata: name: victoriametrics-metrics namespace: monitoring labels: app: victoriametrics-metrics annotations: prometheus.io/port: "8428" prometheus.io/scrape: "true" spec: type: ClusterIP ports: - name: metrics port: 8428 targetPort: 8428 protocol: TCP selector: # 对应victoriametrics的service app: victoria-metrics --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: victoriametrics-metrics namespace: monitoring spec: endpoints: - interval: 15s port: metrics path: /metrics namespaceSelector: matchNames: - monitoring selector: matchLabels: app: victoriametrics-metrics参考Prometheus远程存储Promscale和TimescaleDB测试victoriametricsLSM Tree原理详解
2022年05月15日
1,261 阅读
0 评论
0 点赞
2022-03-03
linuxea:Prometheus远程存储Promscale和TimescaleDB测试
promscale 是一个开源的可观察性后端,用于由 SQL 提供支持的指标和跟踪。它建立在 PostgreSQL 和 TimescaleDB 的强大和高性能基础之上。它通过 OpenTelemetry Collector 原生支持 Prometheus 指标和 OpenTelemetry 跟踪以及许多其他格式,如 StatsD、Jaeger 和 Zipkin,并且100% 兼容 PromQL。其完整的 SQL 功能使开发人员能够关联指标、跟踪和业务数据,从而获得新的有价值的见解,当数据在不同系统中孤立时是不可能的。它很容易与 Grafana 和 Jaeger 集成,以可视化指标和跟踪。它建立在 PostgreSQL 和 TimescaleDB 之上,继承了坚如磐石的可靠性、高达 90% 的本机压缩、连续聚合以及在全球数百万个实例上运行的系统的操作成熟度。Promscale 可以用作 Grafana和PromLens等可视化工具的 Prometheus 数据源。Promscale 包括两个组件:Promscale 连接器:一种无状态服务,为可观察性数据提供摄取接口,处理该数据并将其存储在 TimescaleDB 中。它还提供了一个使用 PromQL 查询数据的接口。Promscale 连接器自动设置 TimescaleDB 中的数据结构以存储数据并在需要升级到新版本的 Promscale 时处理这些数据结构中的更改。TimescaleDB:存储所有可观察性数据的基于 Postgres 的数据库。它提供了用于查询数据的完整 SQL 接口以及分析函数、列压缩和连续聚合等高级功能。TimescaleDB 提供了很大的灵活性来存储业务和其他类型的数据,然后你可以使用这些数据与可观察性数据相关联。Promscale 连接器使用 Prometheusremote_write接口摄取 Prometheus 指标、元数据和 OpenMetrics 示例。它还使用 OpenTelemetry 协议 (OTLP) 摄取 OpenTelemetry 跟踪。它还可以使用 OpenTelemetry 收集器以其他格式摄取指标和跟踪,以通过 Prometheusremote_write接口和 OpenTelemetry 协议处理和发送它们。例如,你可以使用 OpenTelemetry Collector 将 Jaeger 跟踪和 StatsD 指标摄取到 Promscale。对于 Prometheus 指标,Promscale 连接器公开 Prometheus API 端点,用于运行 PromQL 查询和读取元数据。这允许你将支持 Prometheus API 的工具(例如 Grafana)直接连接到 Promscale 进行查询。也可以向 Prometheus 发送查询,并让 Prometheus 使用接口上的 Promscale 连接器从 Promscale 读取数据 remote_read。你还可以使用 SQL 在 Promscale 中查询指标和跟踪,这允许你使用与 PostgreSQL 集成的许多不同的可视化工具。例如,Grafana 支持通过 PostgreSQL 数据源使用开箱即用的 SQL 查询 Promscale 中的数据我准备通过容器的方式进行尝试,我们先安装docker和docker-composeyum install -y yum-utils \ device-mapper-persistent-data \ lvm2 yum-config-manager \ --add-repo \ https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/centos/docker-ce.repo yum install docker-ce docker-ce-cli containerd.io docker-compose -y编排我按照官网的docker配置,进行编排了compose进行测试version: '2.2' services: timescaledb: image: timescaledev/promscale-extension:latest-ts2-pg13 container_name: timescaledb restart: always hostname: "timescaledb" network_mode: "host" environment: - csynchronous_commit=off - POSTGRES_PASSWORD=123 volumes: - /etc/localtime:/etc/localtime:ro - /data/prom/timescaledb/data:/var/lib/postgresql/data:rw mem_limit: 512m user: root stop_grace_period: 1m promscale: image: timescale/promscale:0.10 container_name: promscale restart: always hostname: "promscale" network_mode: "host" environment: - PROMSCALE_DB_PASSWORD=123 - PROMSCALE_DB_PORT=5432 - PROMSCALE_DB_NAME=postgres - PROMSCALE_DB_HOST=127.0.0.1 - PROMSCALE_DB_SSL_MODE=allow volumes: - /etc/localtime:/etc/localtime:ro # - /data/prom/postgresql/data:/var/lib/postgresql/data:rw mem_limit: 512m user: root stop_grace_period: 1m grafana: image: grafana/grafana:8.3.7 container_name: grafana restart: always hostname: "grafana" network_mode: "host" #environment: # - GF_INSTALL_PLUGINS="grafana-clock-panel,grafana-simple-json-datasource" volumes: - /etc/localtime:/etc/localtime:ro - /data/grafana/plugins:/var/lib/grafana/plugins mem_limit: 512m user: root prometheus: image: prom/prometheus:v2.33.4 container_name: prometheus restart: always hostname: "prometheus" network_mode: "host" #environment: volumes: - /etc/localtime:/etc/localtime:ro - /data/prom/prometheus/data:/prometheus:rw # NOTE: chown 65534:65534 /data/prometheus/ - /data/prom/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - /data/prom/prometheus/alert:/etc/prometheus/alert #- /data/prom/prometheus/ssl:/etc/prometheus/ssl command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention=45d' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--web.enable-lifecycle' - '--web.enable-admin-api' mem_limit: 512m user: root stop_grace_period: 1m node_exporter: image: prom/node-exporter:v1.3.1 container_name: node_exporter user: root privileged: true network_mode: "host" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)' restart: unless-stoppedgrafana我们这里使用的root用户就是因为需要手动安装下插件bash-5.1# grafana-cli plugins install grafana-clock-panel ✔ Downloaded grafana-clock-panel v1.3.0 zip successfully Please restart Grafana after installing plugins. Refer to Grafana documentation for instructions if necessary. bash-5.1# grafana-cli plugins install grafana-simple-json-datasource ✔ Downloaded grafana-simple-json-datasource v1.4.2 zip successfully Please restart Grafana after installing plugins. Refer to Grafana documentation for instructions if necessary.配置grafana在这里下载一些模板https://grafana.com/grafana/dashboards/?pg=hp&plcmt=lt-box-dashboards&search=prometheusVisualize data in Promscaleprometheus我们可以尝试配置远程, 配置参数可以查看官网remote_write: - url: "http://127.0.0.1:9201/write" remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true远程配置如下remote_write: - url: "http://127.0.0.1:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true prometheus.yaml如下global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - scheme: http static_configs: - targets: - '127.0.0.1:9093' rule_files: - "alert/host.alert.rules" - "alert/container.alert.rules" - "alert/targets.alert.rules" scrape_configs: - job_name: prometheus scrape_interval: 30s static_configs: - targets: ['127.0.0.1:9090'] - targets: ['127.0.0.1:9093'] - targets: ['127.0.0.1:9100'] remote_write: - url: "http://127.0.0.1:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 remote_read: - url: "http://127.0.0.1:9201/read" read_recent: true重新启动后查看日志ts=2022-03-03T01:35:28.123Z caller=main.go:1128 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml ts=2022-03-03T01:35:28.137Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Starting WAL watcher" queue=797d34 ts=2022-03-03T01:35:28.138Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Starting scraped metadata watcher" ts=2022-03-03T01:35:28.277Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Replaying WAL" queue=797d34 ts=2022-03-03T01:35:38.177Z caller=main.go:1165 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=10.053377011s db_storage=1.82µs remote_storage=13.752341ms web_handler=549ns query_engine=839ns scrape=10.038744417s scrape_sd=44.249µs notify=41.342µs notify_sd=6.871µs rules=30.465µs ts=2022-03-03T01:35:38.177Z caller=main.go:896 level=info msg="Server is ready to receive web requests." ts=2022-03-03T01:35:53.584Z caller=dedupe.go:112 component=remote level=info remote_name=797d34 url=http://127.0.0.1:9201/write msg="Done replaying WAL" duration=25.446317635s查看数据[root@localhost data]# docker exec -it timescaledb sh / # su - postgres timescaledb:~$ psql psql (13.4) Type "help" for help. postgres=# 我们查询一个过去五分钟io的指标查询指标SELECT * from node_disk_io_now WHERE time > now() - INTERVAL '5 minutes'; time | value | series_id | labels | device_id | instance_id | job_id ----------------------------+-------+-----------+---------------+-----------+-------------+-------- 2022-03-02 21:03:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:06:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:07:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:07:58.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:08:28.373-05 | 0 | 348 | {51,140,91,3} | 140 | 91 | 3 2022-03-02 21:03:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:06:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:07:28.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:07:58.373-05 | 0 | 349 | {51,252,91,3} | 252 | 91 | 3 2022-03-02 21:03:58.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:04:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:04:58.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:05:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:05:58.376-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3 2022-03-02 21:06:28.373-05 | 0 | 350 | {51,253,91,3} | 253 | 91 | 3在进行一次聚合查询标签键的查询值每个标签键都扩展为自己的列,该列将外键标识符存储为其值。这允许JOIN按标签键和值进行聚合和过滤。要检索由标签 ID 表示的文本,可以使用该val(field_id) 函数。这使你可以使用特定的标签键对所有系列进行聚合等操作。例如,要查找指标的中值node_disk_io_now,按与其关联的工作分组:SELECT val(job_id) as job, percentile_cont(0.5) within group (order by value) AS median FROM node_disk_io_now WHERE time > now() - INTERVAL '5 minutes' GROUP BY job_id;如下postgres=# SELECT postgres-# val(job_id) as job, postgres-# percentile_cont(0.5) within group (order by value) AS median postgres-# FROM postgres-# node_disk_io_now postgres-# WHERE postgres-# time > now() - INTERVAL '5 minutes' postgres-# GROUP BY job_id; job | median ------------+-------- prometheus | 0 (1 row)查询指标的标签集任何度量行中的labels字段表示与测量相关的完整标签集。它表示为标识符数组。要以 JSON 格式返回整个标签集,你可以使用该jsonb()函数,如下所示:SELECT time, value, jsonb(labels) as labels FROM node_disk_io_now WHERE time > now() - INTERVAL '5 minutes';如下 time | value | labels ----------------------------+-------+------------------------------------------------------------------------------------------------------- 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:58.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:14:28.373-05 | 0 | {"job": "prometheus", "device": "dm-0", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:28.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:13:58.373-05 | 0 | {"job": "prometheus", "device": "dm-1", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:09:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:10:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:11:58.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"} 2022-03-02 21:12:28.373-05 | 0 | {"job": "prometheus", "device": "sda", "__name__": "node_disk_io_now", "instance": "127.0.0.1:9100"}查询node_disk_infopostgres=# SELECT * FROM prom_series.node_disk_info; series_id | labels | device | instance | job | major | minor -----------+------------------------+--------+----------------+------------+-------+------- 250 | {150,140,91,3,324,325} | dm-0 | 127.0.0.1:9100 | prometheus | 253 | 0 439 | {150,253,91,3,508,325} | sda | 127.0.0.1:9100 | prometheus | 8 | 0 440 | {150,258,91,3,507,325} | sr0 | 127.0.0.1:9100 | prometheus | 11 | 0 516 | {150,252,91,3,324,564} | dm-1 | 127.0.0.1:9100 | prometheus | 253 | 1 (4 rows)带标签查询SELECT jsonb(labels) as labels, value FROM node_disk_info WHERE time < now(); Results:labels | value -----------------------------------------------------------------------------------------------------------------------------------+------- {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | NaN {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1 {"job": "prometheus", "major": "253", "minor": "0", "device": "dm-0", "__name__": "node_disk_info", "instance": "127.0.0.1:9100"} | 1通过命令进行查看她的指标视图postgres=# \d+ node_disk_info View "prom_metric.node_disk_info" Column | Type | Collation | Nullable | Default | Storage | Description -------------+--------------------------+-----------+----------+---------+----------+------------- time | timestamp with time zone | | | | plain | value | double precision | | | | plain | series_id | bigint | | | | plain | labels | label_array | | | | extended | device_id | integer | | | | plain | instance_id | integer | | | | plain | job_id | integer | | | | plain | major_id | integer | | | | plain | minor_id | integer | | | | plain | View definition: SELECT data."time", data.value, data.series_id, series.labels, series.labels[2] AS device_id, series.labels[3] AS instance_id, series.labels[4] AS job_id, series.labels[5] AS major_id, series.labels[6] AS minor_id FROM prom_data.node_disk_info data LEFT JOIN prom_data_series.node_disk_info series ON series.id = data.series_id;更多的查询,可以查看官网的教程计划删除promscale里是pg里配删除计划的是90天删除通过SELECT * FROM prom_info.metric;查看我们可以通过调整来修改TimescaleDB 包括一个后台作业调度框架,用于自动化数据管理任务,例如启用简单的数据保留策略。为了添加这样的数据保留策略,数据库管理员可以创建、删除或更改导致drop_chunks根据某个定义的计划自动执行的策略。要在超表上添加这样的策略,不断导致超过 24 小时的块被删除,只需执行以下命令:SELECT add_retention_policy('conditions', INTERVAL '24 hours');随后删除该策略:SELECT remove_retention_policy('conditions');调度程序框架还允许查看已调度的作业:SELECT * FROM timescaledb_information.job_stats;创建数据保留策略以丢弃超过 6 个月的数据块:SELECT add_retention_policy('conditions', INTERVAL '6 months');复制使用基于整数的时间列创建数据保留策略:SELECT add_retention_policy('conditions', BIGINT '600000');我们可以调整prometheus的数据,参考Data RetentionSELECT set_default_retention_period(180 * INTERVAL '1 day')如下postgres=# SELECT postgres-# set_default_retention_period(180 * INTERVAL '1 day'); set_default_retention_period ------------------------------ t (1 row)已经修改为180天我们打开prometheus和grafana都可以正常的查看grafana关于当前版本的压测在slack上,有一个朋友做了压测Promscale版本 2.30.0promscale 0.10.0Hello everyone, I am doing a promscale+timescaledb performance test with 1 promscale(8cpu 32GB memory), 1 timescaledb(postgre12.9+timescale2.5.2 with 16cpu 32G mem), 1 prometheus(8cpu 32G mem), simulate 2500 node_exporters( 1000 metrics/min * 2500 = 2.5 million metrics/min ) . But it seams not stable,做一个promscale+timescaledb性能测试,1个promscale(8cpu 32GB内存),1个timescaledb(postgre12.9+timescale2.5.2 with 16cpu 32G mem),1个prometheus(8cpu 32G mem),模拟2500个node_exporters(1000 指标/分钟 * 2500 = 250 万指标/分钟)。 但它的接缝不稳定异常如下there are warninigs in prometheus: level=info ts=2022-03-01T12:36:33.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=info ts=2022-03-01T12:36:34.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=warn ts=2022-03-01T12:36:34.482Z caller=watcher.go:101 msg="[WARNING] Ingestion is a very long time" duration=5m9.705407837s threshold=1m0s level=info ts=2022-03-01T12:36:35.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=35000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z level=info ts=2022-03-01T12:36:40.365Z caller=throughput.go:76 msg="ingestor throughput" samples/sec=70000 metrics-max-sent-ts=2022-03-01T11:21:48.129Z and errors in prometheus: Mar 01 20:38:55 localhost start-prometheus.sh[887]: ts=2022-03-01T12:38:55.288Z caller=dedupe.go:112 component=remote level=warn remote_name=ceec38 url=http://192.168.105.76:9201/write msg="Failed to send batch, retrying" err="Post \"http://192.168.105.76:9201/write\": context deadline exceeded" any way to increase the thoughput at current configuration?他们推荐使用 remote_write: - url: "http://promscale:9201/write" write_relabel_configs: - source_labels: [__name__] regex: '.*:.*' action: drop remote_timeout: 100s queue_config: capacity: 500000 max_samples_per_send: 50000 batch_send_deadline: 30s min_backoff: 100ms max_backoff: 10s min_shards: 16 max_shards: 16 随后将配置pg 更改为 14.2 和 remote_write 设置,将 promscale mem 增加到 32G,仍然不稳定。请注意,这是2500台的节点压测,那么,这个朋友的测试可以看到至少在目前看来,promscale的开发版本仍然是处于一个初期。官方并没有进行可靠性压测 .我们期待未来的稳定版本
2022年03月03日
1,605 阅读
0 评论
0 点赞
2021-12-30
linuxea:清理kube-prometheus历史数据
通常在k8s中,pod是随时可以被替换的,在整个环境里往往我们不太关注某一条鱼,只关注整个鱼群的状态,因此监控数据不会存储太长,因为借鉴意义并不大。但是有时的确想要从 Prometheus 中删除一些指标,如果这些指标不需要,或者只需要释放一些磁盘空间。Prometheus 中的时间序列只能通过管理 HTTP API 删除(默认禁用)。--web.enable-admin-apiAs of Prometheus 2.0, the --web.enable-admin-api flag controls access to the administrative HTTP API which includes functionality such as deleting time series. This is disabled by default. If enabled, administrative and mutating functionality will be accessible under the /api/*/admin/ paths. The --web.enable-lifecycle flag controls HTTP reloads and shutdowns of Prometheus. This is also disabled by default. If enabled they will be accessible under the /-/reload and /-/quit paths.要启用它--web.enable-admin-api,根据安装方法通过启动脚本或 docker-compose 文件将标志传递给 Prometheus。删除时间序列指标使用以下语法删除与某个标签匹配的所有时间序列指标:curl -X POST \ -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={name="linuxea.com"}'要删除与 somejob或匹配的时间序列指标instance,请运行:curl −X POST −g 'http://localhost:9090/api/v1/admin/tsdb/deleteseries?match[]=job="nodeexporter"' curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="node_exporter"}' curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="192.168.0.1:9100"}'要从 Prometheus 中删除所有数据,请运行:curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={name=~".+"}'请注意,上述 API 调用不会立即删除数据实际数据仍然存在于磁盘上,将在未来的压缩中被清除。要确定何时删除旧数据,使用--storage.tsdb.retention选项 eg --storage.tsdb.retention='365d'(默认情况下,Prometheus 将数据保留 15 天)要完全删除通过delete_series发送clean_tombstonesAPI 调用删除的数据:curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/clean_tombstones'删除所有历史数据我们使用的kube-prometheus,添加--web.enable-admin-api方式有所不同打开prometheus-prometheus.yaml 添加enableAdminAPI: true字段 probeNamespaceSelector: {} probeSelector: {} replicas: 2 resources: requests: memory: 400Mi enableAdminAPI: true ruleSelector: matchLabels: prometheus: k8s而后kubectl apply -f prometheus-prometheus.yaml,pod起来后查看 containers: - args: - '--web.console.templates=/etc/prometheus/consoles' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--config.file=/etc/prometheus/config_out/prometheus.env.yaml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d' - '--web.enable-lifecycle' - '--storage.tsdb.no-lockfile' - '--web.route-prefix=/' - '--web.enable-admin-api' image: 'quay.io/prometheus/prometheus:v2.26.0' 准备开始删除curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'查看删除前的大小[root@linuxea.com prometheus-db]# ls 01FPZGK1S0BZ9FZMSVPYWA1F52 01FQPP5PJTEMT1BARN8ARJTBKB 01FR222TD3ZWYAZTNRJP5SCJ6M 01FR4VB8N4J0D2DY979PTHZQY1 wal 01FQ59ZR5Z9VNNGY3JSXP6F9AQ 01FQWFJAC72YCBKEN19F39PYCM 01FR46R88Q9NN7GR99MMGX13VG 01FR526TSKK8JTN493PTBQH0XC 01FQB3CCFVNSP890V5KE07YD3M 01FQYDBKTHH8KB8K5GT581BKGE 01FR4MFC9JX6NK81G4Q1W0CJNY chunks_head 01FQGWS2M82Q23H0H2TD0JXK1Y 01FR0B54YSBJ0R54B5171P4Z7R 01FR4VB3HHRN0EE1GQCM6G0A77 queries.active [root@linuxea.com nfs-k8s]# du -sh monitoring-prometheus-k8s-db-prometheus-k8s-0-pvc-ac911baf-f1f1-4bf3-a1f8-af2cb13c5d90/ 7.1G monitoring-prometheus-k8s-db-prometheus-k8s-0-pvc-ac911baf-f1f1-4bf3-a1f8-af2cb13c5d90/找到svc[root@linuxea.com nfs-k8s]# kubectl -n monitoring get svc |grep prometheus-k8s prometheus-k8s NodePort 10.68.110.123 <none> 9090:30090/TCP 128d尝试curl一下[root@linuxea.com nfs-k8s]# curl 10.68.110.123:9090 <a href="/graph">Found</a>.执行[root@linuxea.com manifests]# curl -X POST -g 'http://10.68.110.123:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}'而后在查看大小[root@linuxea.com manifests]# kubectl -n monitoring exec -it prometheus-k8s-0 -- /bin/sh Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ du -sh ./ 1004.4M ./ /prometheus $ ls 01FR46R88Q9NN7GR99MMGX13VG 01FR4VB8N4J0D2DY979PTHZQY1 chunks_head 01FR4MFC9JX6NK81G4Q1W0CJNY 01FR526TSKK8JTN493PTBQH0XC queries.active 01FR4VB3HHRN0EE1GQCM6G0A77 01FR592J0H8ANAWMFP2VVQ3NX9 wal /prometheus $ du -sh ./ 1006.6M ./完全删除通过delete_series发送clean_tombstonesAPI 调用删除的数据[root@linuxea.com manifests]# curl -X POST -g 'http://10.68.110.123:9090/api/v1/admin/tsdb/clean_tombstones'在查看大小[root@linuxea.com manifests]# kubectl -n monitoring exec -it prometheus-k8s-0 -- /bin/sh Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ du -sh ./ 986.6M ./删除完成后,将--web.enable-admin-api关闭。修改enableAdminAPI: false字段即可延申参考tsdb-admin-apisAdd flag to enable prometheus web admin API and fixes #1215kube-prometheusspecprometheus-prometheus.yaml
2021年12月30日
2,259 阅读
0 评论
1 点赞
2021-12-16
linuxea:关于kube-prometheus中CPUThrottlingHigh
我们遇到的场景是CPUThrottlingHigh 警报被正常触发,而触发的对象的CPU本身并不高,或者空闲。鉴于此,我们开始怀疑这个警报的必然性。通常在许多情况下,会将此警报修改或者沉默,因为应用程序对延迟不敏感,即使受到限制也可以正常工作,警报基于原因而非症状。因此警报的级别是Info。但是并不能说明此警报是误报。并且沉默只会隐藏背后的真正问题。目前这个问题仍然在讨论中,特别是在这个讨论的特别激烈108,而后在67577也有进一步的讨论表达式如下:sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 )目前,总结了几种易处理的方式1,修改警报阈值比例,或者禁止他 2,取消或者修改对这些 pod 的限制3, 内核4.18或者更高3,完全禁止Kubernetes CFS配额(kubelet配置--cpu-cfs-quota=false)我们尝试修改阈值kubectl -n monitoring edit PrometheusRule prometheus-k8s-rules修改 - alert: CPUThrottlingHigh annotations: description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.' runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/cputhrottlinghigh summary: Processes experience elevated CPU throttling. expr: | sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 75 / 100 ) for: 15m labels: severity: info其他相关参考:https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108 https://github.com/prometheus-operator/prometheus-operator/issues/2063 https://github.com/kubernetes/kubernetes/issues/67577 https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ https://bugzilla.kernel.org/show_bug.cgi?id=198197 https://github.com/torvalds/linux/commit/512ac999d2755d2b7109e996a76b6fb8b888631d https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1 https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt https://github.com/prometheus-operator/kube-prometheus/issues/214 https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/453 https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/b71dd35c6a1d509a1ee902eebe7afe943d8ee4b0/alerts/resource_alerts.libsonnet#L13 https://www.youtube.com/watch?v=UE7QX98-kO0 https://github.com/prometheus-operator/kube-prometheus/issues/861 https://github.com/prometheus-operator/kube-prometheus/blob/main/jsonnet/kube-prometheus/components/alertmanager.libsonnet#L26-L42 https://devops.stackexchange.com/questions/6494/prometheus-alert-cputhrottlinghigh-raised-but-monitoring-does-not-show-it
2021年12月16日
2,488 阅读
0 评论
0 点赞
2019-05-25
linuxea:zabbix 4.2 使用Simple check监控VIP和端口
在实际使用中,通常我们会对一些端口进行监控,比如nginx,mariadb,php等。要完成一个端口监控是简单容易的。net.tcp.listen[80]如上,既可对tcp 80获取状态,而后配置Triggeres做告警处理即可{HOST.NAME} nginx is not running {nginx_80_port:net.tcp.listen[80].last()}<>1 Enabled但是,现在。我要说的并不是这个。在一个环境中一般来讲会有负载均衡和HA冗余,就不能避免的要使用VIP,顾名思义就是虚拟IP。通过操作虚拟IP的飘逸来完成故障后的IP上业务的切换。通常而言,每个VIP都会伴随一个端口。如: lvs,haproxy,nginx。阅读这篇文章,你将了解如何监控VIP:PORT。尤为提醒的是,通过某一台机器添加一次模板就可以对VIP和VIP端口进行检查。创建模板1,Configuration->Templates->Create template 输入名称即可。如:Template App Telnet VIP2,并且Create application ,如:Telnet VIP3,创建Itemstype: Simple check为什么是Simple check,而不是telent?相比较telnet与 Simple check,后者要比前者好用的多,在我看来,不需要写脚本,简单配置就能够完成我们需求。所以,我们这里仍然使用的是: net.tcp.service。推荐阅读官网的service_check_details与simple_checksKey: net.tcp.service[tcp,10.10.195.99,9200]意思为:获取10.10.10.195.99ip地址的tcp 9200端口状态在Applications中选择Telnet VIP除此之外,创建Triggers创建Triggers0 - 服务停止,1 - 服务正在运行 ,由此,我们的触发器就如下:{Template App Telnet VIP:net.tcp.service[tcp,10.10.195.99,9200].last(#3)}=0如果最新三次回去的值都是0就触发告警。如下图即可:到此,使用zabbix简单监控ip和端口已经完成。自动发现参考自动发现基于zabbix4.2 zabbix Discovery 教程延伸阅读service_check_detailssimple_checks阅读更多zabbix监控教程docker教程zabbix-complete-workslinuxea:Zabbix-complete-works之Zabbix基础安装配置linuxea:zabbix4.2新功能之TimescaleDB数据源测试
2019年05月25日
3,573 阅读
0 评论
0 点赞
1
2
...
17