Scaling to zero on Google Kubernetes Engine with KEDA

15/01/2025

For developers and enterprises running applications on Google Kubernetes Engine (GKE), scaling deployment infrastructure to zero when idle can provide significant financial savings. GKE Cluster Autoscaler efficiently manages the size of a node cluster, but for applications that require a full shutdown and startup (scaling a node cluster all the way to zero and vice versa), you’ll need an alternative, as GKE doesn’t provide scale to zero functionality out of the box. This is important for applications with intermittent workloads or variable traffic patterns. In this blog post, Gimasys will help you learn how to integrate the open source Kubernetes Event-driven Autoscaler (KEDA) to achieve this. With KEDA, you can tie your costs directly to your needs, paying only for the resources you use.

Table of contents

Why scale to zero?

Minimizing costs is a primary driver for scaling to zero, and applies to a wide variety of scenarios. For technical experts, this is particularly crucial when dealing with:

GPU-intensive workloads: AI/ML workloads often require powerful GPUs, which can be expensive to keep running even when idle.
Applications with predictable downtime: Internal tools with specific usage hours — scale down resources for applications used only during business hours or specific days of the week.
Seasonal applications: Scale to zero during the off-season for applications with predictable periods of low activity.
On-demand staging environments: Replicate production environments for testing and validation, scaling them to zero after testing is complete.
Development, demo and proof-of-concept environments:
- Short-term demonstrations: Showcase applications or features to clients or stakeholders, scaling down resources after the demonstration.
- Temporary proof-of-concept deployments: Test new ideas or technologies in a live environment, scaling to zero after evaluation.
- Development environment: Spin up resources for testing, code reviews, or feature branches and scale them down to zero when not needed, optimizing costs for temporary workloads.
Event-driven applications:
- Microservices with sporadic traffic: Scale individual services to zero when they are idle and automatically scale them up when requests arrive, optimizing resource utilization for unpredictable traffic patterns.
- Serverless functions: Execute code in response to events without managing servers, automatically scaling to zero when inactive.
Disaster recovery and business continuity: Maintain a minimal set of core resources in a standby state, ready to scale up rapidly in case of a disaster, minimizing costs while ensuring business continuity.

Introducing KEDA for GKE

KEDA is an open-source, Kubernetes-native solution that enables you to scale deployments based on a variety of metrics and events. KEDA can trigger scaling actions based on external events such as message queue depth or incoming HTTP requests. And unlike the current implementation of Horizontal Pod Autoscaler (HPA), KEDA supports scaling workloads to zero, making it a strong choice for handling intermittent jobs or applications with fluctuating demand.

Use cases

Let's explore two common scenarios where KEDA's ability to shrink to zero benefits Gimasys:

Scaling a Pub/Sub worker
- Scenario: A deployment processes messages from a Pub/Sub topic. When no messages are available, scaling down to zero saves resources and costs.
- Solution: KEDA's Pub/Sub scaler monitors the message queue and triggers scaling actions accordingly. By configuring a ScaledObject resource, you can specify that the deployment scales down to zero replicas when the queue is empty.
Scaling a GPU-dependent workload, such as an Ollama deployment for LLM serving
- Scenario: An Ollama-based large language model (LLM) performs inference tasks. To minimize GPU usage and costs, the deployment needs to scale down to zero when there are no inference requests.
- Solution: Combining HTTP-KEDA (a beta feature of KEDA) with Ollama enables scale-to-zero functionality. HTTP-KEDA scales deployments based on HTTP request metrics, while Ollama serves the LLM.

Get started with KEDA on GKE

KEDA offers a powerful and flexible solution for achieving scale-to-zero functionality on GKE. By leveraging KEDA's event-driven scaling capabilities, you can optimize resource utilization, minimize costs, and improve the efficiency of your Kubernetes deployments. Please remember to validate usage scenarios as scale to zero mechanism can influence workload performance. Scaling to zero can increase latency due to cold starts. When an application scales to zero, it means there are no running instances. When a request comes in, a new instance has to be started, increasing latency. There are also considerations about state management. When instances are terminated, any in-memory state is lost.

Conclusion

Auto-scaling to zero on Google Kubernetes Engine using KEDA not only optimizes resource utilization but also delivers significant economic benefits. By automatically shutting down pods when there is no load, enterprises can reduce operational costs and increase resource utilization. KEDA has proven to be a useful tool for building flexible, adaptive, and cost-effective cloud-native systems.

As a senior partner of Google in Vietnam, Gimasys has more than 10+ years of experience, consulting on implementing digital transformation for 2000+ domestic corporations. Some typical customers Jetstar, Dien Quan Media, Heineken, Jollibee, Vietnam Airline, HSC, SSI...

Gimasys is currently a strategic partner of many major technology companies in the world such as Salesforce, Oracle Netsuite, Tableau, Mulesoft.

Contact Gimasys - Google Cloud Premier Partner for advice on strategic solutions suitable to the specific needs of your business:

Email: gcp@gimasys.com
Hotline: 0974 417 099

Why scale to zero?

Introducing KEDA for GKE

Use cases

Get started with KEDA on GKE

Conclusion

Related Posts