China Unicom Cloud

Background

In cloud-native scenarios, enterprises face two core challenges:

Low resource utilization

Online services (such as web services and e-commerce) have significant peak and off-peak hours. The resource idle rate during off-peak hours reaches over 60%.

Offline services (such as AI training and big data analysis) have high resource requirements but low Quality of Service (QoS) requirements. There is a huge gap between resource reservation and actual usage.

Conflict between service isolation and temporal aggregation:

Traditional Kubernetes clusters deploy services in different resource pools, resulting in resource fragmentation.

Solution

CSK Turbo builds a non-intrusive resource overselling system based on the Rubik hybrid deployment engine and dynamic overselling technology:

Colocation architecture:
- Complementary scheduling: Offline services are processed during off-peak hours of online services, improving the cluster CPU utilization by 30% and memory utilization by 10%.
- QoS guarantee mechanism: The Rubik engine uses the single-node resource orchestration, real-time interference detection, and health monitoring modules to suppress the performance interference of offline services in online services. Pods are classified into three levels: online (high QoS), offline (low QoS), and overselling (dynamic reuse). The admission controller is used to implement priority isolation.
Dynamic resource overselling technology:
- Prediction algorithm-driven: Resource profiles are built based on historical data, and oversellable CPU and memory resources of nodes are mined to solve the problem of temporal resource aggregation.
- Customized scheduler: schedules low-priority pods based on the number of oversold resources, breaking the limit of traditional static resource allocation.

Benefits

Higher resource utilization: CPU and memory utilization is significantly improved, reducing hardware procurement costs.
Service compatibility and stability: The solution supports the hybrid deployment of online web services and offline AI training, and is applicable to scenarios such as finance and AI inference. The real-time health check and automatic recovery mechanisms ensure that the QoS jitter rate of online services is less than 1%.
Optimized O&M efficiency: The pluggable architecture simplifies Kubernetes cluster reconstruction. The dynamic overselling mechanism reduces manual O&M intervention and reduces O&M costs.
Security compliance: Using technologies such as kernel-level CPU and memory isolation and network bandwidth suppression, the solution meets financial-level security standards.

China Unicom Cloud

Background ​

Solution ​

Benefits ​

Background

Solution

Benefits