openFuyao v26.03 Release
April 3, 2026
The openFuyao community is dedicated to building an open software ecosystem for diverse computing clusters, focusing on promoting efficient collaboration between cloud-native and AI-native technologies, and enabling the ultimate release of computing power.
The community release v26.03 introduces many new features and optimizes several existing features. The following describes the specific feature additions and changes.
InferNex: AI Inference Capability Fully Upgraded
SIG-ai-inference InferNex launches its first complete solution in v26.03, featuring intelligent routing, elastic scaling and decision systems, observability, distributed KVCache management, and end-to-end one-click deployment capabilities. Average first-token latency is reduced by 30%, and end-to-end latency is reduced by 10%. For detailed performance data, see Table 1:
Table 1 InferNex Performance
| Routing Strategy | Cluster Scenario | E2EL Improvement (avg) | TTFT Improvement (avg) |
|---|---|---|---|
| aggregate KVCache aware | Same-machine cluster | 9.15% | 37.35% |
| PD KVCache aware | Same-machine cluster | 22.08% | 27.73% |
| PD KVCache aware | Cross-machine cluster | 17.31% | 22.03% |
- Elastic Scaler: Fully equipped with distributed inference job elastic scaling resource management and decision-making capabilities, open for decision algorithm integration with built-in tidal algorithm, supports metric & event-driven, from/to 0 elastic scaling capabilities; in particular, supports group and intra-group resource scaling based on user-defined policies to address distributed inference PD separation scenarios, enabling graceful scaling by PD group with fixed PD ratio.
- Hermes-router: Resolves compatibility issues with KVCache aware and bucketing strategies, while further refining state awareness granularity from service level to Pod level, improving routing strategy performance.
- Distributed KVCache: Provides distributed KVCache pooled storage and high-speed cross-instance KVCache transfer, improving cache reuse efficiency; builds hot cache capability to achieve inference performance improvement under fixed total memory usage. Related features and architecture optimizations have been merged into the upstream Mooncake community.
- Eagle-eye: Builds a systematic observability framework for AI inference scenarios, adding A2/A3 generation host-side & card-side RDMA, host-side PCIe bandwidth and other network static metrics, overload frequency reduction and other device sub-health metrics.
- Inference Backend: Supports one-click deployment of cloud-native inference engines based on vLLM/vLLM-Ascend.
For more information, welcome to join the SIG-ai-inference community discussion!
Installation and Deployment: Architecture Refactoring and Capability Enhancement
SIG-installation underwent a major architecture upgrade in v26.03, achieving multiple optimizations:
- Kubernetes Version Compatibility: Supports installation of K8s v1.28 and v1.34 versions.
- Rich Plugin Installation Formats: Extensions support chart-based plugin installation.
- Enhanced Developability: Supports secondary developers to create node pre/post operations, and adds management and business cluster health check interfaces.
NPU DRA Plugin
Completed deep adaptation of Ascend NPU devices based on Kubernetes native DRA architecture:
- Supports CEL expression device filtering based on NUMA node, chip model, topology group, and other metadata.
- Supports ResourceClaim/ResourceClaimTemplate resource requests.
- Injects devices into containers through CDI, enabling fine-grained resource scheduling.
UB Container Network Device Plugin
Enables applications to use URMA devices for communication, reducing communication latency and improving application performance.
UB Memory Pooling
- Memory Borrowing: Based on UB memory pooling mechanism, in bare metal container scenarios, when node or NUMA memory usage reaches a preset value, memory borrowing is triggered to offload part of the memory pressure to borrowed memory. This is suitable for scenarios with large numbers of Pods or containers deployed on a single node. Through memory overcommitment and borrowing mechanisms, node memory utilization is improved and hardware costs are reduced.
- Memory Sharing: Supports importing and exporting memory blocks within UBS Server cluster through memory pooling capabilities, enabling cross-node and cross-process memory sharing on bare metal. At the same time, resource security and QoS are ensured through directory isolation and proxy layer. This is suitable for scenarios requiring cross-node sharing of large memory datasets (such as in-memory databases, big data analytics) to avoid data replication and improve processing efficiency.
Confidential Containers
Based on Kunpeng TEE technology, built through the complete software stack of K8s+containerd+Kata+QEMU+KVM+CoCo, enabling confidential container deployment, providing strong isolation similar to traditional virtual machines, and avoiding security issues between different containers.
This article is first published by the openFuyao Community. Reproduction is permitted in accordance with the terms of the CC-BY-SA 4.0 License.
