Background

In cloud-native scenarios, enterprises face two core challenges:

  • Low resource utilization

Online services (such as web services and e-commerce) have significant peak and off-peak hours. The resource idle rate during off-peak hours reaches over 60%.

Offline services (such as AI training and big data analysis) have high resource requirements but low Quality of Service (QoS) requirements. There is a huge gap between resource reservation and actual usage.

  • Conflict between service isolation and temporal aggregation:

Traditional Kubernetes clusters deploy services in different resource pools, resulting in resource fragmentation.

Solution

image

CSK Turbo builds a non-intrusive resource overselling system based on the Rubik hybrid deployment engine and dynamic overselling technology:

  1. Colocation architecture:
    • Complementary scheduling: Offline services are processed during off-peak hours of online services, improving the cluster CPU utilization by 30% and memory utilization by 10%.
    • QoS guarantee mechanism: The Rubik engine uses the single-node resource orchestration, real-time interference detection, and health monitoring modules to suppress the performance interference of offline services in online services. Pods are classified into three levels: online (high QoS), offline (low QoS), and overselling (dynamic reuse). The admission controller is used to implement priority isolation.
  2. Dynamic resource overselling technology:
    • Prediction algorithm-driven: Resource profiles are built based on historical data, and oversellable CPU and memory resources of nodes are mined to solve the problem of temporal resource aggregation.
    • Customized scheduler: schedules low-priority pods based on the number of oversold resources, breaking the limit of traditional static resource allocation.

Benefits

  • Higher resource utilization: CPU and memory utilization is significantly improved, reducing hardware procurement costs.
  • Service compatibility and stability: The solution supports the hybrid deployment of online web services and offline AI training, and is applicable to scenarios such as finance and AI inference. The real-time health check and automatic recovery mechanisms ensure that the QoS jitter rate of online services is less than 1%.
  • Optimized O&M efficiency: The pluggable architecture simplifies Kubernetes cluster reconstruction. The dynamic overselling mechanism reduces manual O&M intervention and reduces O&M costs.
  • Security compliance: Using technologies such as kernel-level CPU and memory isolation and network bandwidth suppression, the solution meets financial-level security standards.