Background
In cloud-native scenarios, enterprises face two core challenges:
- Low resource utilization
Online services (such as web services and e-commerce) have significant peak and off-peak hours. The resource idle rate during off-peak hours reaches over 60%.
Offline services (such as AI training and big data analysis) have high resource requirements but low Quality of Service (QoS) requirements. There is a huge gap between resource reservation and actual usage.
- Conflict between service isolation and temporal aggregation:
Traditional Kubernetes clusters deploy services in different resource pools, resulting in resource fragmentation.
Solution

CSK Turbo builds a non-intrusive resource overselling system based on the Rubik hybrid deployment engine and dynamic overselling technology:
- Colocation architecture:
- Complementary scheduling: Offline services are processed during off-peak hours of online services, improving the cluster CPU utilization by 30% and memory utilization by 10%.
- QoS guarantee mechanism: The Rubik engine uses the single-node resource orchestration, real-time interference detection, and health monitoring modules to suppress the performance interference of offline services in online services. Pods are classified into three levels: online (high QoS), offline (low QoS), and overselling (dynamic reuse). The admission controller is used to implement priority isolation.
- Dynamic resource overselling technology:
- Prediction algorithm-driven: Resource profiles are built based on historical data, and oversellable CPU and memory resources of nodes are mined to solve the problem of temporal resource aggregation.
- Customized scheduler: schedules low-priority pods based on the number of oversold resources, breaking the limit of traditional static resource allocation.
Benefits
- Higher resource utilization: CPU and memory utilization is significantly improved, reducing hardware procurement costs.
- Service compatibility and stability: The solution supports the hybrid deployment of online web services and offline AI training, and is applicable to scenarios such as finance and AI inference. The real-time health check and automatic recovery mechanisms ensure that the QoS jitter rate of online services is less than 1%.
- Optimized O&M efficiency: The pluggable architecture simplifies Kubernetes cluster reconstruction. The dynamic overselling mechanism reduces manual O&M intervention and reduces O&M costs.
- Security compliance: Using technologies such as kernel-level CPU and memory isolation and network bandwidth suppression, the solution meets financial-level security standards.