Tovi Solution Brief: Modernizing High Performance Computing Throughout Academia, Industry, and Government

Tovi is an intelligent resource manager and job scheduler that enables HPC workloads on Kubernetes for embedded, on-premise, cloud, and hybrid infrastructures. Tovi is easy-to-use, framework-independent, and highly customizable and helps HPC systems unlock resources that were previously underutilized. Tovi radically improves the speed of model development and data analysis for applications with high computational requirements, including AI/ML, GEOINT, video/image processing, training, high-fidelity simulations, weapons/navigation systems, autonomous platforms, and C5ISR.

Tovi is an intelligent resource manager and job scheduler for embedded, on-premise, cloud, and hybrid HPC infrastructures. Tovi allows you to use shared resources much more efficiently, thereby significantly improving the speed of model development and data analysis for applications with high computational requirements. Tovi is easy-to-use, framework-independent, and highly customizable.

Tovi was designed to address key capability gaps in software modernization. Specifically, Tovi enables HPC software infrastructure to modernize to Kubernetes, the de facto standard for modern software architecture. Kubernetes is quickly becoming the foundation for all software in the DoD, from jets, bombers, & ships to weapons, space & nuclear systems and can reside at the edge, on premise, and in the cloud. Indeed, the DoD Enterprise DevSecOps reference design mandates the use of Kubernetes clusters and other open-source technologies to achieve DoD-side continuous Authority to Operate. However, current HPC software infrastructure is fundamentally incompatible with Kubernetes container orchestration. Therefore, we developed Tovi, an intelligent resource manager and job scheduler that enables HPC workloads on Kubernetes for embedded, on-premise, cloud, and hybrid infrastructures. Tovi radically improves the speed of model development and data analysis for applications with high computational requirements. Indeed, Tovi addresses the Army’s growing need for high performance computational capabilities to exploit large data sets, compute complex AI/ML algorithms, and to enhance operational activities and training readiness. Tovi will radically improve power efficiency, edge computing, and synthetic environments. In addition, Tovi will enable AI/ML to reduce the cognitive burden on humans and improve overall performance through human-machine teaming.

Modernizing HPC infrastructure to Kubernetes allows all services to be run under one environment, which brings increased efficiency to all services, not just HPC workloads, and also decreased management overhead. In addition, Tovi allows platforms to retain their existing data pipelines. Tovi is exceptionally flexible and compatible with a variety of other tools, enabling customization for specific deployment needs. Rather than forcing systems to use resources a certain way, Tovi helps unlock resources that were previously underutilized. Tovi modernizes any parallel computing system, from on-premise research clusters to edge systems in the military and beyond.

There is an urgent need to modernize HPC platforms to maximize resource utilization and to simultaneously benefit from all the advantages of Kubernetes and containerization; a microservice architecture facilitates rapid prototyping and deployment while increasing security. In addition, current HPC systems are not scalable, modular, nor fault tolerant. Furthermore, current HPC workload management solutions are difficult to use and even harder to maintain in production environments. By enabling HPC workloads on Kubernetes, Tovi unlocks all the advantages of Kubernetes for all military and industrial systems with high computational requirements. In addition, Tovi will improve security, enable new applications, help reduce costs, and minimize vendor-lock. Another advantage of Tovi is the speed at which HPC application updates can be deployed, avoiding downtime. This makes HPC applications more stable as rolling updates can be implemented without downtime. In summary, by modernizing HPC systems to Kubernetes, Tovi enables flexibility, power, and scalability that will pay off in the long term by radically reducing costs.

HPC is used in so many different environments and applications because of the capabilities of commodity COTS hardware (CPUs, GPUs, FPGAs). HPC in the military has been growing and continues to grow. The scale of impact is massive: Tovi can be used for any application with high computational requirements where resources are aggregated for efficient parallel processing to solve advanced mathematical calculations and perform data processing.

Tovi is easy to use and highly customizable. Many HPC workload solutions are designed to be “end-to-end” pipelines, locking you into a workflow and/or ecosystem with little flexibility or customization. Tovi was designed with the Linux philosophy of “do one thing, and do it well”: Tovi efficiently allocates jobs to shared resources. Other job schedulers are difficult to use, and even harder to manage in dynamic production systems. Tovi is simple, easy to use, and designed for a wide range of HPC applications. Tovi is built as a Kubernetes application, which means adding and removing resources, updating system settings, and customizing deployment is painless and often requires little or no downtime. Tovi lets users request resources instead of time and handles the scheduling to maximize resource utilization and minimize wait times. Also, Tovi reduces HPC administrative costs by drastically reducing headcount and allowing for the consolidation of previously distinct services under a single Kubernetes based system.

At the heart of the Tovi system is an intelligent job scheduler and resource manager to allow for GPU, CPU, and memory allocation across a network of devices. Tovi provides a seamless user interface to pool resources from a variety of hardware, including servers, idle workstations, and even edge devices. Tovi is a distributed application which uses a client-server architecture to connect remote users with shared resources. The Tovi Server application is deployed to a Kubernetes cluster and handles job scheduling by efficiently tracking and allocating processes as the required compute resources become available. The Tovi Server has three major components: the receiver, the manager, and the scheduler. The receiver listens for requests from the Tovi client application, authenticates the user, and forwards the request to the manager. The manager processes the request by type, copying the submitted files to the appropriate locations. Job submissions are added to a queue which is managed by the scheduler. The scheduler scans cluster machines at regular intervals, allocating jobs from the queue as resources become available. The scheduler is highly customizable, so this behavior can be set as desired. The Tovi Client can be used as a command line interface (CLI) tool, or via a web application. Remote users can create and manage projects, upload data to a shared directory, and submit jobs to be run on the server. Requests are defined by a special tovi.yaml file and can powerfully encode complex experiment pipelines. The web app expands this functionality by providing a graphical user interface with which users can browse libraries of datasets and results. 

Microservices have replaced monolithic architectures to support scalable software that is composed of smaller applications (containers) that communicate through language-independent interfaces. The use of containers has exploded in recent years and this trend is expected to continue for the foreseeable future. Managing complex container deployments is known as orchestration, and Kubernetes is the most popular container orchestration platform available. In fact, Kubernetes is one of the most successful open-source projects of all time. The DoD is moving to a microservices infrastructure, with many systems being designed for a microservices framework from the start.

Kubernetes in the Department of Defense (DoD)

Kubernetes in the Department of Defense (DoD)

The DoD Enterprise DevSecOps reference design mandates the use of Cloud Native Computing Foundation-compliant Kubernetes clusters and other open-source technologies to achieve DoD-side continuous Authority to Operate (ATO).

Modern software infrastructure is built on a microservices framework, which leverages containers to run software reliably when moved from one computing environment to another. With the growth of Artificial Intelligence (AI), Machine Learning (ML), and cybersecurity, a critical need has emerged for DevSecOps in the U.S. DoD to solve the problem of long software development and delivery cycles. A primary focus of the DoD’s DevSecOps initiative is avoiding any vendor lock. Therefore, the DoD mandated Open Container Initiative (OCI) containers with no vendor lock-in to containers or container runtimes/builders. Since containers are immutable, this will allow the DoD to accredit and harden containers. Also, the DoD mandated Cloud Native Computing Foundation (CNCF) Kubernetes compliant cluster for container orchestration with no vendor lock-in for orchestration options, networking, or storage APIs.

Kubernetes brings the DoD many advantages: 1) Resiliency: when a container fails or crashes it can be automatically restarted, thereby providing a self-healing capability; 2) Baked-in Security: The DoD’s Sidecar Container Security Stack (SCSS) can be automatically injected into any Kubernetes cluster with Zero Trust; 3) Adaptability: There is no downtime when swapping out modular containers; 4) Automation: The GitOps model and Infrastructure as Code (IAC) enable automation; 5) Auto-scaling: Kubernetes automatically scales based on compute/memory needs; and 6) Abstraction layer: Since Kubernetes is managed by CNCF there is no fear of getting lock-in to Cloud APIs or a specific platform.

The DoD is moving to cloud-native environments and microservices, with many systems currently being designed for a microservices framework from the start. Kubernetes is quickly becoming the foundation for all software in the DoD, from jets to bombers to ships. Kubernetes is running across systems throughout the DoD, which can reside on embedded systems, at the edge, and in the cloud. In 2019, a team at Hill Air Force Base in Utah successfully demonstrated Kubernetes on an F-16 jet. Currently, teams are working on building applications on top of Kubernetes for all facets of weapons systems, from space systems to nuclear systems to jets.

Tovi: Modernizing HPC Systems Throughout Academia, Industry, and Government

Tovi: Modernizing HPC Systems Throughout Academia, Industry, and Government

Modernizing HPC infrastructure allows all services to be run under one environment, which brings both increased efficiency to all services, not just HPC workloads, and decreased management overhead for system administrators.

High performance computing (HPC) systems are ubiquitous in academic, industrial, and government organizations and are used for a wide range of applications from basic research all the way to operational systems. Depending on the application, an HPC system can be composed of a single compute server (i.e., node) or can be composed of hundreds or even thousands of nodes that are networked together. HPC continues to experience rapid adoption as a game-changing technology and decision-making aid. In addition, with the ever-growing volume and velocity of data across the battlefield and the increasing use of Artificial Intelligence (AI) in theater, HPC systems are expected to also be deployed at the tactical edge. This will lead to even further adoption of HPC systems as they move beyond stationary emplacements to mobile, forward-deployable systems.

Whereas HPC systems leverage the latest software and hardware to achieve remarkable processing capabilities, there is a fundamental problem with current HPC infrastructure: HPC workload management solutions (e.g. SLURM, HTCondor, IBM LSF, etc.) are not compatible with modern microservices architecture. Microservices enable applications to be built more easily by breaking them down into smaller components that work together collectively. Microservices have replaced monolithic architectures to support scalable software that is composed of smaller applications (i.e., containers) that work together by communicating through language-independent interfaces. Indeed, containers are an increasingly popular method to enable software to run reliably when moved from one computing environment to another. The use of containers has exploded in recent years and this trend is expected to continue for the foreseeable future. The majority of containers are orchestrated, with Kubernetes being the most popular container orchestration platform available. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Managed Kubernetes services have seen significant growth in recent years and are still the standard for managing container environments in the cloud.

While Kubernetes was built to orchestrate applications of loosely coupled, containerized services the types of applications that run on Kubernetes is very different from those that run on HPC systems. Indeed, HPC applications are designed to run to completion, leveraging resources optimally, whereas Kubernetes applications usually run continuously. Microservices users typically leverage containers for speed and modularity; whereas HPC users are more focused on portability and the ability to encapsulate the software with containers. While containers are extremely valuable for a wide range of HPC applications, making the switch to containers can be very difficult. Indeed, HPC applications are difficult to deploy on Kubernetes. Tovi was designed to modernize HPC clusters, enabling HPC users to benefit from Kubernetes orchestration. Thus, Tovi provides a robust and reliable solution to run HPC workloads with Kubernetes. Therefore, Tovi enables organizations to simultaneously reduce costs and maintain maximum flexibility. Tovi is built as a Kubernetes application, which means Tovi users can benefit from Kubernetes container orchestration and at the same time maintain all the benefits of a reliable and robust HPC workload manager. Tovi is simple, easy to use, and designed with the demands of a wide range of HPC applications in mind. Tovi lets users request resources instead of requesting time. Tovi handles the scheduling to maximize resource utilization and minimize wait times. Moreover, since Tovi is built as a Kubernetes application, adding and removing resources, updating system settings, and customizing the deployment is painless and often requires little or no downtime.

Tovi Solution Brief: Run HPC Workloads with Kubernetes

Tovi Solution Brief: Run HPC Workloads with Kubernetes

Tovi is an innovative solution that enables organizations to deploy High Performance Computing (HPC) applications on Kubernetes. Tovi makes it super easy for HPC sites to modernize their software infrastructure and switch to containers.

Popularity of Containers and Kubernetes

Modern software infrastructure is built on a microservices framework. Indeed, containers are an increasingly popular method to enable software to run reliably when moved from one computing environment to another. The use of containers has exploded in recent years and this trend is expected to continue for the foreseeable future. The majority of containers are orchestrated, with Kubernetes being the most popular container orchestration platform available. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Managed Kubernetes services have seen significant growth in recent years and are still the standard for managing container environments in the cloud.

Containers, Kubernetes, and HPC Applications

Containers are extremely valuable for a wide range of HPC applications, however making the switch to containers can be very difficult. While Kubernetes is exceptional for orchestrating containers, it is very challenging to run HPC workloads with Kubernetes. Indeed, HPC applications are difficult to deploy on Kubernetes because Kubernetes jobs are typically long-running services that run to completion, whereas HPC applications often demand low-latency and high-throughput scheduling to execute jobs in parallel across many nodes and often require specialized resources like GPUs or access to limited software licenses.

Running Kubernetes Orchestration and HPC Workloads

Organizations seeking to deploy HPC applications on Kubernetes can try some of the following limited solutions:

  • Support separate HPC and containerized infrastructures: This approach may have some utility for certain organizations that are already heavily invested in HPC infrastructure. However, this option increases infrastructure and management costs since it requires deploying new containerized applications on a separate cluster from the HPC cluster.
  • Use an existing HPC workload manager and run containerized workloads: This may be a viable option for organizations with simple requirements and a desire to maintain their existing HPC scheduler. However, such an approach will preclude access to native Kubernetes features and consequently may constrain flexibility in managing long-running services where Kubernetes excels.
  • Use Kubernetes native job scheduling features: This may be a viable option for organizations that have not invested much in HPC applications, but it is not practical for the majority of HPC users.

Solution: Tovi

Tovi was designed to address all the shortcomings above, thereby providing a robust and reliable solution to run HPC workloads with Kubernetes. Therefore, Tovi enables organizations to simultaneously reduce costs and maintain maximum flexibility. Tovi is built as a Kubernetes application, which means Tovi users can benefit from Kubernetes container orchestration and at the same time maintain all the benefits of a reliable and robust HPC workload manager. Tovi is simple, easy to use, and designed with the demands of a wide range of HPC applications in mind. Tovi lets users request resources instead of requesting time. Tovi handles the scheduling to maximize resource utilization and minimize wait times. Moreover, since Tovi is built as a Kubernetes application, adding and removing resources, updating system settings, and customizing the deployment is painless and often requires little or no downtime.