Có gì thay đổi Nếu đang sử dụng phụ đề trong Google Meet, thì giờ…
Introducing Kubernetes Control Plane metrics in GKE
An essential aspect of operating any application is the ability to observe the health and performance of that application and of the underlying infrastructure to quickly resolve issues as they arise. Google Kubernetes Engine (GKE) already provides audit logs, operational logs, and metrics along with out-of-the-box dashboards and automatic error reporting to facilitate running reliable applications at scale. Using these logs and metrics, Cloud Operations provides the alerts, monitoring dashboards and a Logs Explorer to quickly detect, troubleshoot and resolve issues.
Introducing Kubernetes control plane metrics and why they matter
In addition to these existing sources of telemetry data, we are excited to announce that we are now exposing Kubernetes control plane metrics, which are now Generally Available. With GKE, Google fully manages the Kubernetes control plane; however, when troubleshooting issues it can be helpful to have access to certain metrics emitted by the Kubernetes control plane.
As part of our vision to make Kubernetes easier to use and easier to operate, these control plane metrics are directly integrated with Cloud Monitoring, so you don't need to manage any metric collection or scrape config.
Ví dụ: Để hiểu tình trạng của máy chủ API, bạn có thể sử dụng các số liệu như apiserver_request_total và apiserver_request_duration_seconds để theo dõi tải mà API Server đang trải qua, một phần nhỏ các yêu cầu API Server trả về lỗi và độ trễ phản hồi cho các yêu cầu nhận được bởi API Server. Ngoài ra, apiserver_storage_objects có thể rất hữu ích để theo dõi độ bão hòa của API Server, đặc biệt nếu bạn đang sử dụng custom controller. Chia nhỏ số liệu này theo nhãn tài nguyên để tìm ra tài nguyên hoặc Kubernetes custom controller nào có vấn đề.
When a pod is created it is initially placed in a "pending" state, indicating it hasn't yet been scheduled on a node. In a healthy cluster, pending pods are relatively quickly scheduled on a node, providing the workload the resources it needs to run. However, a sustained increase in the number of pending pods may indicate a problem scheduling those pods, which may be caused by insufficient resources or inappropriate configuration. Metrics like scheduler_pending_pods, scheduler_schedule_attempts_total, scheduler_preemption_attempts_total, scheduler_preemption_victims , and scheduler_scheduling_attempt_duration_seconds can alert you to potential scheduling issues, so you can act quickly to ensure sufficient resources are available for your pods. Using these metrics in combination will help you better understand the health of your cluster. For instance, if scheduler_preemption_attempts_total goes up, it means that there are higher priority pods available to be scheduled and the Scheduler is preempting some running pods. However, if the value of scheduler_pending_pods is also increasing, this may indicate that you don’t have enough resources to allocate the higher priority pods.
If the Kubernetes scheduler is still unable to find a suitable node for a pod, then the pod will eventually be marked as unschedulable. Kubernetes control plane metrics provide you visibility into pod scheduling errors and unschedulable pods. A spike in either means that the Kubernetes scheduler isn't able to find an appropriate node on which to run many of your pods, which may ultimately impair the performance of your application. In many cases, a high rate of unschedulable pods will not resolve itself until you take some action to address the underlying cause. A good first place to start troubleshooting the issue is to look for recent FailedScheduling events. (If you have GKE system logs enabled, then all Kubernetes events are available in Cloud Logging.) These FailedScheduling events include a message (for instance, "0/6 nodes are available: 6 Insufficient cpu.") that very helpfully describes exactly why the pod wasn't able to be scheduled on any nodes, giving you guidance on how to address the problem.
A final example: If you see scheduling jobs is very slow, then one possible cause is that a third-party webhooks might be introducing significant latency, causing the API server to take a long time to schedule a job. Kubernetes control plane metrics such as apiserver_admission_webhook_admission_duration_seconds can expose the admission webhook latency, helping you identify the root cause of slow job scheduling and mitigate the issue.
Displayed in context
Not only are we making these additional Kubernetes control plane metrics available, we're also excited to announce that all of these metrics are displayed in the Kubernetes Engine section of the Cloud Console, making it easy to identify and investigate issues in-context as you're managing your GKE clusters.
To view these control plane metrics, go to the Kubernetes clusters section of the Cloud Console, select the "Observability" tab, and select "Control plane":
Since all Kubernetes control plane metrics are ingested into Cloud Monitoring, you can create alerting policies in Cloud Alerting so you're notified as soon as something needs your attention.
PromQL compatible
When you enable Kubernetes control plane metrics for your GKE clusters, all metrics are collected using Google Cloud Managed Service for Prometheus. This means the metrics are sent to Cloud Monitoring in the same GCP project as your Kubernetes cluster and can be queried using PromQL via the Cloud Monitoring API and Metrics explorer.
Ví dụ: Bạn có thể theo dõi bất kỳ mức tăng đột biến nào về độ trễ phản hồi của máy chủ API phân vị thứ 99 bằng cách sử dụng truy vấn PromQL này:
sum by (instance, verb) (histogram_quantile (0,99, rate (apiserver_request_duration_seconds_bucket {cluster = “cluster-name”} [5m])) )
Third-party support
If you monitor your GKE cluster using popular third party observability tools, any third party observability tool can ingest these Kubernetes control plane metrics using the Cloud Monitoring API.
Ví dụ: Nếu bạn là khách hàng của Datadog và bạn đã bật số liệu mặt phẳng điều khiển Kubernetes cho cụm GKE của mình, thì Datadog cung cấp hình ảnh trực quan nâng cao bao gồm các số liệu Kubernetes Control plane từ API server, schedule và controller manager.
Pricing
All Kubernetes control plane metrics are charged at the standard price for metrics ingested from Google Cloud Managed Service for Prometheus.
If your business is interested in the Google Cloud Platform then you can connect to Gimasys - Google Premier Partner - for consulting solutions according to the unique needs of your business. Contact now:
- Gimasys – Google Cloud Premier Partner
- Hotline: Ha Noi: 0987 682 505 – Ho Chi Minh: 0974 417 099
- Email: gcp@gimasys.com
Source: Gimasys