With a growing number of applications and internal business processes now relying on machine learning (ML), the need to automate ML workflows and align them with DevOps practices is more urgent than ever.
However, tools for controlled collaboration and continuous integration for ML models have yet to reach a parallel level of maturity as traditional software CI/CD pipelines.
Additionally, there’s the complexity of ML workflows, usually comprised of multiple building blocks such as ML libraries, development environments, and runtimes.
Most significant, however, is the fact that training production-grade ML models requires expensive compute infrastructure (GPUs/CPUs).
Despite high levels of technical competence, not all ML practitioners and data scientists have the expertise required to quickly launch, reproduce, and manage ML lifecycles. This includes reproducing ML development and testing environments, versioning model artifacts, and deploying trained models. The gap between ML experimentation and research, on the one hand, and production deployment of ML algorithms, on the other, is the main reason for the growing popularity of ML PaaS offerings.
ML PaaS offers ML practitioners many useful features, including preconfigured and customizable development environments, automated model training, logging, monitoring, and scalable model hosting. These features let ML teams focus on the core business problems instead of DevOps duties. ML PaaS frees ML teams from the DevOps tasks of deploying and configuring compute environments and tools.
In this article, I discuss FloydHub—one of the most popular ML PaaS offerings in the market today. To understand how FloydHub automates the ML pipeline, I’ll focus on the key components of the platform including projects, workspaces, datasets, and ML jobs and metrics. I’ll also cover how one can easily serve their model via FloydHub to make it accessible through REST API.
How FloydHub Automates ML Workflows
FloydHub helps streamline the process of building, training, and deploying AI/ML models by providing preconfigured development environments, ML libraries, on-demand CPU and GPU resources, built-in metrics, logging, and security features.
FloydHub allows ML practitioners to run ML jobs on the cloud infrastructure they prefer without the hassle of deploying GPU and CPU machines in the cloud. It does all the heavy lifting—including the configuration of the Python runtime, Jupyter Notebook environment, running cloud servers, and securing workloads—under the hood.
FloydHub Core Components
The FloydHub platform is a set of Docker containers running on AWS EC2 instances. These instances are provisioned with specific GPU/CPU configuration, ML libraries, and development environments. Many other tools are also accessible for data researchers via REST API. To access FloydHub instances, developers can use a web portal or interact with the FloydHub platform using a well-documented REST API via the Floyd CLI.
FloydHub makes it easy to organize and automate the ML workflow through a set of core components: projects, workspaces, datasets, as well as jobs and metrics. Let’s discuss them in greater detail.
FloydHub projects organize the ML workflow. Similar to GitHub repositories, they contain code and all of your project resources (configuration files, automated tests, artifacts, etc.). But unlike local ML projects, FloydHub projects version all training iterations (jobs) and keep them organized for later reuse.
As with GitHub repositories, users are able to view the full project history. Inside the project, users can also create new workspaces (see below), train models, and create REST APIs for their models.
Are you a tech blogger?
Developing and running an ML model locally (e.g., on a laptop), often requires manually setting up the development and testing environments. This may include installing Python and configuring its isolated virtual environments, ML libraries (Tensorflow and Pytorch), JupyterLab, and web servers. Most importantly, you’ll need to have access to high-performance CPUs or GPUs to train your models.
With FloydHub, you can quickly set up such an environment using workspaces. A workspace is a cloud-based IDE (based on JupyterLab) that includes everything you need to train your ML model’s code without deploying your own ML environment. FloydHub workspaces offers preconfigured access to the following useful features:
- Jupyter notebooks.
- On-demand FloydHub GPUs and CPUs for training your models, including Tesla K80 GPU, Tesla V100 GPU, Intel Xeon CPU, and other high-performance CPU- and GPU-based machines.
- Floyd CLI access to FloydHub environment to run scripts.
- A set of supported development environments including PyTorch and Tensorflow (1 and 2), Theano, Caffe, MxNet, Chainer, and Torch.
In addition, you can easily share workspaces with team members, attach and save datasets, start/stop environments whenever you need them, and switch between different CPUs and GPUs.
The biggest advantage of FloydHub over local ML development is access to powerful processors and GPUs, which are very expensive to deploy on-prem. FloydHub’s on-demand pricing model requires no upfront payment. You can buy powerups for additional CPU and GPU hours or powerups for additional storage, depending on your needs.
When developing ML models locally, ML practitioners may have multiple copies of the same dataset in different projects. In contrast, FloydHub datasets are stored separately from a project’s code. Users can upload their datasets once and then “mount” them to the job they want.
Abstracting jobs from datasets makes it easy to reuse the same datasets in different jobs saving a significant amount of time on each job. Also, because FloydHub datasets are versioned, keeping track of all dataset transformations during the data preparation and data engineering phases is simple. FloydHub also provides access to a number of popular public datasets such as MNIST, which is quite handy.
Jobs and Metrics
FloydHub Jobs allow your local code to be run in the FloydHub ML environment using preconfigured libraries, runtimes, and compute resources (CPUs and GPUs). As the job is running, FloydHub emits all the logs and training metrics associated with it to your terminal and web interface and saves important job information such as submission time, datasets, and libraries used. Once the job is finished, users can also download the files it generated from the web interface.
In addition, FloydHub generates metrics collected from the Python script logs used in the job. Anything a job prints to stdout is parsed by FloydHub and is converted into metrics. Currently, FloydHub can identify and parse Keras logs.
As I’ve already discussed, all the jobs passed through FloydHub are recorded with their outputs. These outputs can contain model checkpoints such as weights and biases generated by the model, which may be very useful during the ML experimentation stage.
For example, let’s say you run a job to test a new idea. If it works, you can save the model outputs and use them in another job. If the second job fails, you can return to the original iterations that worked and continue from there. Such an approach allows you to run experiments in a controlled way and audit the model development process.
Serving FloydHub Models
Deploying a trained model requires at least a minimal knowledge of web programming in order to launch and configure a web server and create REST API endpoints. FloydHub makes it easy to serve your models without digging into web server configuration.
The floyd run –mode serve command is all that’s needed to activate a web server on FloydHub and create a REST API endpoint for querying your model. The only thing FloydHub users need to do is to provide an app.py file with the boilerplate application code created using their favorite Python web framework.
For companies and researchers seeking to bridge the gap between ML model development and their deployment to production, FloydHub perfectly addresses their pain points. The platform takes care of all the important components of the ML workflow, simplifying the process of training, testing, and deploying both simple deep learning models and complex ML systems. This saves ML developers having to invest their time and energy in configuring local development environments and provisioning costly training infrastructure.