Scalability is one of the major requirements for resource-intensive ML workloads and models that consist of multiple layers, billions of weights, and complex error-minimization logic. Processing client requests for complex models like recurrent neural nets, deep convolutional networks, or generative models takes up a significant amount of CPU time and compute power. These resources need to be dynamically scaled once the model gains traction among users. Additionally, ML models need to be regularly retrained on newly ingested data. Consequently, the growth of datasets requires that more and more compute power be available on-demand.
Kubernetes is a powerful container orchestration platform that meets the scalability requirements of modern ML workflows and applications. The platform supports various abstractions for the creation of stateful replicated deployments, autoscaling, high availability, and service discovery, which make it easy to serve and retrain ML models.
In this blog post, I show how to scale Keras ML models on Kubernetes using the Kubeflow ML platform. First, I discuss the Keras deep learning library and its key features, then walk readers through the process of deploying Keras training jobs with Kubeflow. I talk about Kubeflow training job abstractions and the way to implement distributed training of Keras models and then proceed to the discussion of Kubeflow and Kubernetes tools suitable for serving Keras models in production. In the last part of this tutorial, I’ll show you how to use Seldon Core and other serving frameworks to efficiently scale Keras models on Kubernetes in production.