With the growing integration of AI/ML into applications and business processes, production-grade ML models require more scalable infrastructure and compute power for training and deployment.
Modern ML algorithms train on large volumes of data and require billions of iterations to minimize their cost functions. Vertical scaling of such models involves OS-level bottlenecks—including the number of CPUs, GPUs, and storage that can be provisioned—and has proven to be inefficient for such models. More efficient parallel processing algorithms, such as asynchronous training and allreduce-style training, require a distributed cluster system where different workers learn simultaneously in a coordinated fashion.
Scalability is also important for serving DL models in production. Processing a single API request to the model prediction endpoint may trigger a complex processing logic that can take a significant amount of time. As more users are hitting the model’s endpoints, more serving instances are required to process client requests efficiently. Being able to serve ML models in a distributed and scalable way becomes essential to ensuring the usability of ML applications.
Addressing these scalability challenges in a distributed cloud environment is hard. MLOps engineers face the challenge of configuring interactions between multiple nodes and inference services while ensuring fault tolerance, high availability, and application health.
In this blog post, I’ll discuss how Kubernetes and Kubeflow can meet these scalability requirements for TensorFlow ML models. I’ll walk you through several practical examples, describing how to scale ML models with Kubeflow on Kubernetes.
First, I’ll discuss how to use TensorFlow training jobs (TFJobs) abstraction to orchestrate distributed training of TensorFlow models on Kubernetes via Kubeflow. Then I’ll show how to implement TF distribution strategies for synchronous and asynchronous distributed training. Finally, I’ll discuss various options for scaling TF models serving in Kubernetes—including KFServing, Seldon Core, and BentoML.
By the end of the article, you’ll have a better understanding of the basic K8s and Kubeflow abstractions as well as the tools available to scale your TensorFlow models, both for training and production-grade serving.
Scaling TF Models with Kubernetes and Kubeflow
Kubeflow is an ML framework for Kubernetes originally developed by Google. It builds on top of Kubernetes resources and orchestration services to implement complex automated ML pipelines for training and serving ML models.
Kubernetes and Kubeflow can be used together to implement efficient scaling of TensorFlow models. Key resources and features that enable scalability of TF models include the following:
- Manual scaling of K8s Deployments and StatefulSets using kubectl
- Autoscaling with the Horizontal Pod Autoscaler, based on a set of compute metrics (CPU, GPU, memory), or user-defined metrics (e.g., requests per second)
- Distributed training of TF models with the TFJob and the MPI Operator
- Scaling deployed TF models with KFServing, Seldon Core, and Bento ML
In what follows, I’ll provide examples illustrating how to use some of these solutions to efficiently scale your TensorFlow models on Kubernetes.
Scalable TensorFlow Training with TFJob
Scaling TensorFlow training jobs in Kubernetes can be achieved via distributed training implemented using TF distribution strategies. There are two common types of distribution strategies for ML training: synchronous and asynchronous.
In synchronous training, workers train on a specific batch of the training data in parallel. Each worker makes its own forward propagation step and the overall result of the iteration is aggregated.
Are you a tech blogger?
In contrast, in asynchronous training, workers learn on the same data in parallel. In this approach, there is a central entity called a parameter server, which is responsible for aggregating and computing gradients, and passing updated parameters to each worker.
Implementing such a strategy in a distributed cluster is not trivial. In particular, workers should be able to communicate data and weights across nodes and coordinate their learning efficiently while avoiding errors.
To save developers time, TensorFlow implements various distributed training strategies in the tf.distribute.Strategy module. Using this module, ML developers can distribute training across multiple nodes and GPUs with minimal changes to their code.
The module implements several synchronous strategies, including MirroredStrategy, TPUStrategy, and MultiWorkerMirroredStrategy. It also implements an asynchronous ParameterServerStrategy. You can read more about available TF distribution strategies and how to implement them in your TF code in this article.
Kubeflow ships with the TF Operator and a custom TFJob resource that makes it easy to create the TF distribution strategies mentioned above. TFJob can recognize a distributed strategy defined in the containerized TF code, and can manage it using a set of built-in components and control logic. Components that make it possible to implement distributed training for TF in Kubeflow include:
- Chief: Orchestrates distributed training and performs model checkpointing
- Parameter Servers: Coordinate asynchronous distributed training and calculate gradients
- Workers: Perform a learning task
- Evaluators: Compute and log evaluation metrics
The above components can be configured in the TFJob, a Kubeflow CRD for TensorFlow training. Below is a basic example of a distributed training job that relies on two workers that perform training without chiefs and parameter servers. This approach is suitable for implementing TF synchronous training strategies, such as MirroredStrategy.
- name: tensorflow
- mountPath: "/train"
- name: "training"
Here you see that along with the standard Kubernetes resources and services—such as volumes, containers, restart policies—the spec includes a tfReplicaSpecs, inside which you define a worker. Setting the worker replica count to two, along with defining a relevant distribution strategy in the containerized TF code, is enough for implementing synchronous strategy with Kubeflow.
When a TFJob is initialized, a new TF_CONFIG environment variable is created on every worker node. It contains information about training batches, current training iteration, and other parameters used by the TFJob to perform distributed training. Tf-operator coordinates the training process by interacting with various K8s controllers, APIs, and maintaining the desired state defined in the manifest.
Also, with tf-operator, the asynchronous mode of training can use the ParameterServerStrategy. Below is an example of a distributed training job with an asynchronous strategy managed by the tf-operator.
- name: tensorflow
- name: tensorflow
TFJob is not the only way to implement distributed training of TensorFlow models with Kubeflow. An alternative solution is provided by the MPI Operator. Under the hood, the MPI Operator uses the Message Passing Interface (MPI) that enables cross-node communication between workers in heterogeneous network environments and via different communication layers. The MPI Operator can be used to implement Allreduce-style synchronous training for TF models in Kubernetes.
Scalable Serving of TensorFlow Models on Kubernetes
Scalable serving is critical for the production deployment of ML workloads because processing client requests to inference services may be an extremely time- and resource-intensive task. In this context, deployed models should be able to scale to multiple replicas and to serve multiple concurrent requests.
Kubeflow supports several serving options for TensorFlow models. Among them one should note the following:
- TFServing is the Kubeflow implementation of the TFX Serving module. TFServing allows you to create ML model REST APIs and offers many useful features, including service rollouts, automated lifecycle management, traffic splitting, and versioning. However, this option does not provide an autoscaling feature.
- Seldon Core is a third-party tool that can be used with Kubeflow abstractions and resources. It supports multiple ML frameworks, including TensorFlow, and allows turning trained TF models into REST/gRPC microservices running in Kubernetes. Seldon Core supports model autoscaling by default.
- BentoML is another third-party tool used by Kubeflow to provide advanced model serving features, including autoscaling, and high-performance API model servers with the micro-batching support.
In the next section, I’ll show how to autoscale trained TF models using KFServing, a module which is a part of the default Kubeflow installation.
Autoscaling TensorFlow Models with KFServing
KFServing is a serverless platform that makes it easy to turn trained TF models into inference services accessible from outside of the K8s clusters. KFServing enables networking and ingress using Istio, health checking, canary rollouts, point-in-time snapshots, traffic routing, and flexible server configuration for your deployed TF models.
Also, KFServing supports autoscaling of trained TF models out of the box. Under the hood, KFServing relies on Knative Serving autoscaling features. Knative has two implementations of autoscaling. One is based on the Knative Pod Autoscaler (KPA) tool and the other on the Kubernetes Horizontal Pod Autoscaler (HPA).
When you deploy the InferenceService via KFServing, the KPA is enabled by default. It supports scale to zero functionality, where the served model can be scaled to have zero remaining replicas when there is no traffic. The main limitation of the KPA is that it does not support CPU-based autoscaling.
If you don’t have GPUs in your cluster, you can use the HPA autoscaler that supports CPU-based autoscaling. However, it’s not part of the KFServing installation and should be enabled after the KFServing is installed.
As I’ve said, by default, KFServing uses the KPA so you get autoscaling for your InferenceService immediately after it is deployed. You can customize the KPA behavior using the InferenceService manifest.
By default, KPA scales models based on the average number of incoming requests per pod. KFServing sets default target concurrency to one, which means that if the service gets three requests, KPA would scale it to three pod replicas. You can customize this behavior by changing the autoscaling.knative.dev/target annotation, as in the example above where you set it to 10. Once this setting is enabled, KPA will increase the number of replicas only if the number of concurrent requests increases to 10.
Using KFServing you can configure other autoscaling targets. For example, you can scale trained models based on the average requests per second (RPS) using the requests-per-second-target-default annotation.
As I’ve shown in this article, Kubeflow provides a lot of useful tools for scaling TF model training and serving on Kubernetes. You can use Kubeflow to implement synchronous and asynchronous training with TF distribution strategies.
Tf-operator makes it easy to define various components you need to perform distributed training efficiently in your K8s cluster. Also, Kubeflow supports the MPI Operator, which is a great solution for implementing Allreduce-style multi-node training using the MPI.
Kubeflow has a great feature set when it comes to scaling trained TF models as well. Tools like KFServing allow you to customize the scaling logic, including the RPS and request concurrency targets, depending on your needs.
You can also use Kubernetes-native tools, like HPA, to scale models based on the user-defined metrics. You can look into other great tools for serving, such as Seldon Core and Bento ML. Both of them support autoscaling and offer many useful features for automating served model versioning, canary rollouts, updates, and lifecycle management.