Tovi Solution Brief: Modernizing High Performance Computing Throughout Academia, Industry, and Government

Tovi is an intelligent resource manager and job scheduler that enables HPC workloads on Kubernetes for embedded, on-premise, cloud, and hybrid infrastructures. Tovi is easy-to-use, framework-independent, and highly customizable and helps HPC systems unlock resources that were previously underutilized. Tovi radically improves the speed of model development and data analysis for applications with high computational requirements, including AI/ML, GEOINT, video/image processing, training, high-fidelity simulations, weapons/navigation systems, autonomous platforms, and C5ISR.

Tovi is an intelligent resource manager and job scheduler for embedded, on-premise, cloud, and hybrid HPC infrastructures. Tovi allows you to use shared resources much more efficiently, thereby significantly improving the speed of model development and data analysis for applications with high computational requirements. Tovi is easy-to-use, framework-independent, and highly customizable.

Tovi was designed to address key capability gaps in software modernization. Specifically, Tovi enables HPC software infrastructure to modernize to Kubernetes, the de facto standard for modern software architecture. Kubernetes is quickly becoming the foundation for all software in the DoD, from jets, bombers, & ships to weapons, space & nuclear systems and can reside at the edge, on premise, and in the cloud. Indeed, the DoD Enterprise DevSecOps reference design mandates the use of Kubernetes clusters and other open-source technologies to achieve DoD-side continuous Authority to Operate. However, current HPC software infrastructure is fundamentally incompatible with Kubernetes container orchestration. Therefore, we developed Tovi, an intelligent resource manager and job scheduler that enables HPC workloads on Kubernetes for embedded, on-premise, cloud, and hybrid infrastructures. Tovi radically improves the speed of model development and data analysis for applications with high computational requirements. Indeed, Tovi addresses the Army’s growing need for high performance computational capabilities to exploit large data sets, compute complex AI/ML algorithms, and to enhance operational activities and training readiness. Tovi will radically improve power efficiency, edge computing, and synthetic environments. In addition, Tovi will enable AI/ML to reduce the cognitive burden on humans and improve overall performance through human-machine teaming.

Modernizing HPC infrastructure to Kubernetes allows all services to be run under one environment, which brings increased efficiency to all services, not just HPC workloads, and also decreased management overhead. In addition, Tovi allows platforms to retain their existing data pipelines. Tovi is exceptionally flexible and compatible with a variety of other tools, enabling customization for specific deployment needs. Rather than forcing systems to use resources a certain way, Tovi helps unlock resources that were previously underutilized. Tovi modernizes any parallel computing system, from on-premise research clusters to edge systems in the military and beyond.

There is an urgent need to modernize HPC platforms to maximize resource utilization and to simultaneously benefit from all the advantages of Kubernetes and containerization; a microservice architecture facilitates rapid prototyping and deployment while increasing security. In addition, current HPC systems are not scalable, modular, nor fault tolerant. Furthermore, current HPC workload management solutions are difficult to use and even harder to maintain in production environments. By enabling HPC workloads on Kubernetes, Tovi unlocks all the advantages of Kubernetes for all military and industrial systems with high computational requirements. In addition, Tovi will improve security, enable new applications, help reduce costs, and minimize vendor-lock. Another advantage of Tovi is the speed at which HPC application updates can be deployed, avoiding downtime. This makes HPC applications more stable as rolling updates can be implemented without downtime. In summary, by modernizing HPC systems to Kubernetes, Tovi enables flexibility, power, and scalability that will pay off in the long term by radically reducing costs.

HPC is used in so many different environments and applications because of the capabilities of commodity COTS hardware (CPUs, GPUs, FPGAs). HPC in the military has been growing and continues to grow. The scale of impact is massive: Tovi can be used for any application with high computational requirements where resources are aggregated for efficient parallel processing to solve advanced mathematical calculations and perform data processing.

Tovi is easy to use and highly customizable. Many HPC workload solutions are designed to be “end-to-end” pipelines, locking you into a workflow and/or ecosystem with little flexibility or customization. Tovi was designed with the Linux philosophy of “do one thing, and do it well”: Tovi efficiently allocates jobs to shared resources. Other job schedulers are difficult to use, and even harder to manage in dynamic production systems. Tovi is simple, easy to use, and designed for a wide range of HPC applications. Tovi is built as a Kubernetes application, which means adding and removing resources, updating system settings, and customizing deployment is painless and often requires little or no downtime. Tovi lets users request resources instead of time and handles the scheduling to maximize resource utilization and minimize wait times. Also, Tovi reduces HPC administrative costs by drastically reducing headcount and allowing for the consolidation of previously distinct services under a single Kubernetes based system.

At the heart of the Tovi system is an intelligent job scheduler and resource manager to allow for GPU, CPU, and memory allocation across a network of devices. Tovi provides a seamless user interface to pool resources from a variety of hardware, including servers, idle workstations, and even edge devices. Tovi is a distributed application which uses a client-server architecture to connect remote users with shared resources. The Tovi Server application is deployed to a Kubernetes cluster and handles job scheduling by efficiently tracking and allocating processes as the required compute resources become available. The Tovi Server has three major components: the receiver, the manager, and the scheduler. The receiver listens for requests from the Tovi client application, authenticates the user, and forwards the request to the manager. The manager processes the request by type, copying the submitted files to the appropriate locations. Job submissions are added to a queue which is managed by the scheduler. The scheduler scans cluster machines at regular intervals, allocating jobs from the queue as resources become available. The scheduler is highly customizable, so this behavior can be set as desired. The Tovi Client can be used as a command line interface (CLI) tool, or via a web application. Remote users can create and manage projects, upload data to a shared directory, and submit jobs to be run on the server. Requests are defined by a special tovi.yaml file and can powerfully encode complex experiment pipelines. The web app expands this functionality by providing a graphical user interface with which users can browse libraries of datasets and results. 

Microservices have replaced monolithic architectures to support scalable software that is composed of smaller applications (containers) that communicate through language-independent interfaces. The use of containers has exploded in recent years and this trend is expected to continue for the foreseeable future. Managing complex container deployments is known as orchestration, and Kubernetes is the most popular container orchestration platform available. In fact, Kubernetes is one of the most successful open-source projects of all time. The DoD is moving to a microservices infrastructure, with many systems being designed for a microservices framework from the start.

Kubernetes in the Department of Defense (DoD)

Kubernetes in the Department of Defense (DoD)

The DoD Enterprise DevSecOps reference design mandates the use of Cloud Native Computing Foundation-compliant Kubernetes clusters and other open-source technologies to achieve DoD-side continuous Authority to Operate (ATO).

Modern software infrastructure is built on a microservices framework, which leverages containers to run software reliably when moved from one computing environment to another. With the growth of Artificial Intelligence (AI), Machine Learning (ML), and cybersecurity, a critical need has emerged for DevSecOps in the U.S. DoD to solve the problem of long software development and delivery cycles. A primary focus of the DoD’s DevSecOps initiative is avoiding any vendor lock. Therefore, the DoD mandated Open Container Initiative (OCI) containers with no vendor lock-in to containers or container runtimes/builders. Since containers are immutable, this will allow the DoD to accredit and harden containers. Also, the DoD mandated Cloud Native Computing Foundation (CNCF) Kubernetes compliant cluster for container orchestration with no vendor lock-in for orchestration options, networking, or storage APIs.

Kubernetes brings the DoD many advantages: 1) Resiliency: when a container fails or crashes it can be automatically restarted, thereby providing a self-healing capability; 2) Baked-in Security: The DoD’s Sidecar Container Security Stack (SCSS) can be automatically injected into any Kubernetes cluster with Zero Trust; 3) Adaptability: There is no downtime when swapping out modular containers; 4) Automation: The GitOps model and Infrastructure as Code (IAC) enable automation; 5) Auto-scaling: Kubernetes automatically scales based on compute/memory needs; and 6) Abstraction layer: Since Kubernetes is managed by CNCF there is no fear of getting lock-in to Cloud APIs or a specific platform.

The DoD is moving to cloud-native environments and microservices, with many systems currently being designed for a microservices framework from the start. Kubernetes is quickly becoming the foundation for all software in the DoD, from jets to bombers to ships. Kubernetes is running across systems throughout the DoD, which can reside on embedded systems, at the edge, and in the cloud. In 2019, a team at Hill Air Force Base in Utah successfully demonstrated Kubernetes on an F-16 jet. Currently, teams are working on building applications on top of Kubernetes for all facets of weapons systems, from space systems to nuclear systems to jets.

Benefits of Using Kubernetes

Benefits of Using Kubernetes

Kubernetes has taken the industry by storm and its popularity continues to grow. In fact, Kubernetes is one of the most successful open-source projects of all time. Experts initially believed that Kubernetes would only be used by large companies. However, it has become increasingly clear that all companies, large and small, are poised to benefit from switching to Kubernetes.

Modern software infrastructure is built on a microservices framework. Indeed, containers are an increasingly popular method to enable software to run reliably when moved from one computing environment to another. The use of containers has exploded in recent years and this trend is expected to continue for the foreseeable future. The majority of containers are orchestrated, with Kubernetes being the most popular container orchestration platform available.

Kubernetes is an extremely popular open-source orchestrator for deploying and managing containerized applications at scale for cloud, multi-cloud, on-premise, and hybrid environments. Often called the “Linux of the cloud,” Kubernetes is flexible, scalable, and open source. Kubernetes enables organizations to deploy modern applications that are scalable, modular, and fault-tolerant, thereby freeing up developers from manual tasks around infrastructure management and leading to significant productivity gains.

Kubernetes accelerates productivity of developers, improves security, enables new applications, helps reduce costs, minimizes vendor-lock, and streamlines the task of managing containers. The adoption of Kubernetes within enterprise IT environments is significant and continues to grow. Kubernetes has been proven robust for a wide range of production environments, as it is currently used by thousands of IT teams. In addition, more and more businesses are switching over to Kubernetes to solve their container orchestration needs. One of the reasons Kubernetes has been so quickly adopted and continues to grow is it delivers clear benefits to multiple stakeholders in an organization, including both operations teams and development teams.

In addition, Kubernetes is future-proof. All major cloud vendors support Kubernetes and also provide out-of-the-box solutions for implementation. The Kubernetes ecosystem is growing rapidly as new products supporting different needs on top of the Kubernetes platform are continuously being released.

Kubernetes is portable and flexible and therefore can work with virtually any type of program that runs containers as well as any type of underlying infrastructure, such as cloud, on-premise, or hybrid infrastructures. In fact, Kubernetes can even host workloads across multiple clouds.

Another advantage of Kubernetes is the speed at which application updates can be deployed, avoiding downtime. This is makes applications more stable as rolling updates can be implemented without downtime.

Kubernetes stands apart from other container orchestration options and is the clear choice for managing modern container deployments in a manner that is efficient, flexible, and business-friendly. Not all companies are using Kubernetes, but more and more organizations will continue to modernize to Kubernetes, enabling flexibility, power, and scalability that will pay off in the long term.