Tovi is an intelligent resource manager and job scheduler that enables HPC workloads on Kubernetes for embedded, on-premise, cloud, and hybrid infrastructures. Tovi is easy-to-use, framework-independent, and highly customizable and helps HPC systems unlock resources that were previously underutilized. Tovi radically improves the speed of model development and data analysis for applications with high computational requirements, including AI/ML, GEOINT, video/image processing, training, high-fidelity simulations, weapons/navigation systems, autonomous platforms, and C5ISR.
Tovi is an intelligent resource manager and job scheduler for embedded, on-premise, cloud, and hybrid HPC infrastructures. Tovi allows you to use shared resources much more efficiently, thereby significantly improving the speed of model development and data analysis for applications with high computational requirements. Tovi is easy-to-use, framework-independent, and highly customizable.
Tovi was designed to address key capability gaps in software modernization. Specifically, Tovi enables HPC software infrastructure to modernize to Kubernetes, the de facto standard for modern software architecture. Kubernetes is quickly becoming the foundation for all software in the DoD, from jets, bombers, & ships to weapons, space & nuclear systems and can reside at the edge, on premise, and in the cloud. Indeed, the DoD Enterprise DevSecOps reference design mandates the use of Kubernetes clusters and other open-source technologies to achieve DoD-side continuous Authority to Operate. However, current HPC software infrastructure is fundamentally incompatible with Kubernetes container orchestration. Therefore, we developed Tovi, an intelligent resource manager and job scheduler that enables HPC workloads on Kubernetes for embedded, on-premise, cloud, and hybrid infrastructures. Tovi radically improves the speed of model development and data analysis for applications with high computational requirements. Indeed, Tovi addresses the Army’s growing need for high performance computational capabilities to exploit large data sets, compute complex AI/ML algorithms, and to enhance operational activities and training readiness. Tovi will radically improve power efficiency, edge computing, and synthetic environments. In addition, Tovi will enable AI/ML to reduce the cognitive burden on humans and improve overall performance through human-machine teaming.
Modernizing HPC infrastructure to Kubernetes allows all services to be run under one environment, which brings increased efficiency to all services, not just HPC workloads, and also decreased management overhead. In addition, Tovi allows platforms to retain their existing data pipelines. Tovi is exceptionally flexible and compatible with a variety of other tools, enabling customization for specific deployment needs. Rather than forcing systems to use resources a certain way, Tovi helps unlock resources that were previously underutilized. Tovi modernizes any parallel computing system, from on-premise research clusters to edge systems in the military and beyond.
There is an urgent need to modernize HPC platforms to maximize resource utilization and to simultaneously benefit from all the advantages of Kubernetes and containerization; a microservice architecture facilitates rapid prototyping and deployment while increasing security. In addition, current HPC systems are not scalable, modular, nor fault tolerant. Furthermore, current HPC workload management solutions are difficult to use and even harder to maintain in production environments. By enabling HPC workloads on Kubernetes, Tovi unlocks all the advantages of Kubernetes for all military and industrial systems with high computational requirements. In addition, Tovi will improve security, enable new applications, help reduce costs, and minimize vendor-lock. Another advantage of Tovi is the speed at which HPC application updates can be deployed, avoiding downtime. This makes HPC applications more stable as rolling updates can be implemented without downtime. In summary, by modernizing HPC systems to Kubernetes, Tovi enables flexibility, power, and scalability that will pay off in the long term by radically reducing costs.
HPC is used in so many different environments and applications because of the capabilities of commodity COTS hardware (CPUs, GPUs, FPGAs). HPC in the military has been growing and continues to grow. The scale of impact is massive: Tovi can be used for any application with high computational requirements where resources are aggregated for efficient parallel processing to solve advanced mathematical calculations and perform data processing.
Tovi is easy to use and highly customizable. Many HPC workload solutions are designed to be “end-to-end” pipelines, locking you into a workflow and/or ecosystem with little flexibility or customization. Tovi was designed with the Linux philosophy of “do one thing, and do it well”: Tovi efficiently allocates jobs to shared resources. Other job schedulers are difficult to use, and even harder to manage in dynamic production systems. Tovi is simple, easy to use, and designed for a wide range of HPC applications. Tovi is built as a Kubernetes application, which means adding and removing resources, updating system settings, and customizing deployment is painless and often requires little or no downtime. Tovi lets users request resources instead of time and handles the scheduling to maximize resource utilization and minimize wait times. Also, Tovi reduces HPC administrative costs by drastically reducing headcount and allowing for the consolidation of previously distinct services under a single Kubernetes based system.
At the heart of the Tovi system is an intelligent job scheduler and resource manager to allow for GPU, CPU, and memory allocation across a network of devices. Tovi provides a seamless user interface to pool resources from a variety of hardware, including servers, idle workstations, and even edge devices. Tovi is a distributed application which uses a client-server architecture to connect remote users with shared resources. The Tovi Server application is deployed to a Kubernetes cluster and handles job scheduling by efficiently tracking and allocating processes as the required compute resources become available. The Tovi Server has three major components: the receiver, the manager, and the scheduler. The receiver listens for requests from the Tovi client application, authenticates the user, and forwards the request to the manager. The manager processes the request by type, copying the submitted files to the appropriate locations. Job submissions are added to a queue which is managed by the scheduler. The scheduler scans cluster machines at regular intervals, allocating jobs from the queue as resources become available. The scheduler is highly customizable, so this behavior can be set as desired. The Tovi Client can be used as a command line interface (CLI) tool, or via a web application. Remote users can create and manage projects, upload data to a shared directory, and submit jobs to be run on the server. Requests are defined by a special tovi.yaml file and can powerfully encode complex experiment pipelines. The web app expands this functionality by providing a graphical user interface with which users can browse libraries of datasets and results.
Microservices have replaced monolithic architectures to support scalable software that is composed of smaller applications (containers) that communicate through language-independent interfaces. The use of containers has exploded in recent years and this trend is expected to continue for the foreseeable future. Managing complex container deployments is known as orchestration, and Kubernetes is the most popular container orchestration platform available. In fact, Kubernetes is one of the most successful open-source projects of all time. The DoD is moving to a microservices infrastructure, with many systems being designed for a microservices framework from the start.