Maintaining a catalog of machine learning models in production is no trivial task. There are several considerations taken to build confidence around deploying new models. Such considerations include evaluating the quality of training data, correctness of feature input data, enforcing tight API contracts between services interacting with the models, updating the dependencies required to run the model, among others. Some of these which involve data can be handled with testing and API validation. The remainder are where we start to run into trouble, sometimes resulting in lengthy iterations between engineering and machine learning teams.
There are several reasons we would want to replace models in production. We might make a database change to support an application feature which alters how data is stored and accessed by queries. The model might have new features or updated dependencies to support feature design or improved modeling performance. Such changes can break the API of the model, sometimes requiring us to retrain all of our models or modify the surrounding interface if they share a common deployed service in our production stack. Models will also inherently become stale as historical data distributions may slowly change with time resulting in inaccuracies.
This process can be painful and time consuming as the back and forth of model handoff and bug-fixing between application engineers and machine learning engineers often requires retraining the model each time. Depending on the dataset, the model, and the resources available, retraining can take hours or days, meaning a new set of models could take a week or more before they get into production. If we run an operation which requires models to be replaced often, this type of iteration becomes several engineers’ full-time jobs.
We can mitigate some of this internally by cross-training our machine learning engineers to be able to implement their own models into our services, or by first implementing dummy models trained on tiny datasets to quickly wire them up, and replacing them with fully trained models prior to deploying. Ideally, though, we want some level of automation that provides up-to-date, fully trained and tested models that enable our engineers to implement with relative ease.
The goals of such a system include:
Let’s tackle these requirements one by one using Apache Airflow and MLflow.
The primary goal is to automatically train machine learning models. We want to have regularly scheduled tasks, which train models based on updates to the training dataset.
On the data engineering side, we rely on Apache Airflow to regularly ingest new data. Therefore, it makes much sense to use the same tool to schedule machine learning training tasks which are immediately downstream of data ingestion.
For each model we can create an Airflow DAG which
This allows us to explicitly schedule training automation downstream of data ingestion, ensuring that newly trained models always use the most recently acquired data.
For each training run, we want to ensure that we save the serialized machine learning model to an artifact store along with relevant metadata, metrics, required dependencies, and any additional artifacts or visualizations needed to evaluate the performance, strengths, and weaknesses of the model. For this use case we leverage MLflow.
MLflow provides a webserver UI and artifact store integration (S3) out of the box. This allows us to centralize our machine learning models into a single tracking application. We can audit models, evaluate performance, store and view metadata and other reporting data related to the models themselves.
By wrapping the MLflow client in our machine learning model codebase, we can also customize and standardize all of our tracking data and artifact structure for consistency across our models and implementations. This allows for a generalized implementation in applications downstream, as all of the tracked models can be accessed using the same code.
MLflow also provides a REST API, which allows end users to programmatically fetch different versions of models and their associated artifacts. This API allows us to build configuration at the service level around machine learning model implementations.
The API allows us to reference promoted models by their corresponding model IDs and versions as they are tracked in the MLflow application. This allows developers to swap models in-place easily, but also for our continuous delivery pipelines to download and integration test model implementations at build time.
Combined with robust data validation, this gives the engineering team more confidence and less time spent in updating applications with new models. Implementing a newly trained model should consist mostly of updating dependencies and modifying a configuration file to point to a new model version.
Robust regression and integration testing come relatively cheaply behind these powerful tools. Each run of a machine learning model acts as a regression test for the models themselves with each new code release. If we update our machine learning models, we will know within our next training schedule interval if any of the newly pushed code has any code related bugs. We can also add application integration testing of these new models at the end of the Airflow DAGs for additional assurances upstream of publishing and deploying new application builds.