Tovi: Modernizing HPC Systems Throughout Academia, Industry, and Government
Modernizing HPC infrastructure allows all services to be run under one environment, which brings both increased efficiency to all services, not just HPC workloads, and decreased management overhead for system administrators.
High performance computing (HPC) systems are ubiquitous in academic, industrial, and government organizations and are used for a wide range of applications from basic research all the way to operational systems. Depending on the application, an HPC system can be composed of a single compute server (i.e., node) or can be composed of hundreds or even thousands of nodes that are networked together. HPC continues to experience rapid adoption as a game-changing technology and decision-making aid. In addition, with the ever-growing volume and velocity of data across the battlefield and the increasing use of Artificial Intelligence (AI) in theater, HPC systems are expected to also be deployed at the tactical edge. This will lead to even further adoption of HPC systems as they move beyond stationary emplacements to mobile, forward-deployable systems.
Whereas HPC systems leverage the latest software and hardware to achieve remarkable processing capabilities, there is a fundamental problem with current HPC infrastructure: HPC workload management solutions (e.g. SLURM, HTCondor, IBM LSF, etc.) are not compatible with modern microservices architecture. Microservices enable applications to be built more easily by breaking them down into smaller components that work together collectively. Microservices have replaced monolithic architectures to support scalable software that is composed of smaller applications (i.e., containers) that work together by communicating through language-independent interfaces. Indeed, containers are an increasingly popular method to enable software to run reliably when moved from one computing environment to another. The use of containers has exploded in recent years and this trend is expected to continue for the foreseeable future. The majority of containers are orchestrated, with Kubernetes being the most popular container orchestration platform available. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Managed Kubernetes services have seen significant growth in recent years and are still the standard for managing container environments in the cloud.
While Kubernetes was built to orchestrate applications of loosely coupled, containerized services the types of applications that run on Kubernetes is very different from those that run on HPC systems. Indeed, HPC applications are designed to run to completion, leveraging resources optimally, whereas Kubernetes applications usually run continuously. Microservices users typically leverage containers for speed and modularity; whereas HPC users are more focused on portability and the ability to encapsulate the software with containers. While containers are extremely valuable for a wide range of HPC applications, making the switch to containers can be very difficult. Indeed, HPC applications are difficult to deploy on Kubernetes. Tovi was designed to modernize HPC clusters, enabling HPC users to benefit from Kubernetes orchestration. Thus, Tovi provides a robust and reliable solution to run HPC workloads with Kubernetes. Therefore, Tovi enables organizations to simultaneously reduce costs and maintain maximum flexibility. Tovi is built as a Kubernetes application, which means Tovi users can benefit from Kubernetes container orchestration and at the same time maintain all the benefits of a reliable and robust HPC workload manager. Tovi is simple, easy to use, and designed with the demands of a wide range of HPC applications in mind. Tovi lets users request resources instead of requesting time. Tovi handles the scheduling to maximize resource utilization and minimize wait times. Moreover, since Tovi is built as a Kubernetes application, adding and removing resources, updating system settings, and customizing the deployment is painless and often requires little or no downtime.