The advancements in machine learning has more and more enterprises turning towards the insights provided by it. Data scientists are busy creating and fine-tuning machine learning models for tasks ranging from recommending music to detecting fraud. However, as is always the case with new technology, machine learning comes with its own set of challenges:
- Concept Drift - Accuracy of model degrades over time due to disparity in training data vs production data
- Locality - Pre-trained models’ accuracy levels change with changing demography/geography/customer
- Data Quality - Changes in data quality affect accuracy levels
- Scalability – Data scientists, while good at creating models, don’t necessarily have the skills to operationalize models at enterprise scale
- Process & Collaboration - A lot of models are developed and remain confined to sandboxes or within silos in an organization with no clearly defined process for the model lifecycle
- Model Governance – Most data science projects have no model governance – who can create models, who can deploy them, what datasets were used for training? Most ML projects do not have this defined clearly
A typical ML model lifecycle
Here’s what a machine learning model lifecycle looks like:
What is MLOps
According to Wikipedia, “MLOps (‘Machine Learning’ + ‘Operations’) is a practice for collaboration and communication between data scientists and operations professionals to help manage the production ML lifecycle. MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. MLOps applies to the entire lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.”
How is MLOps different from DevOps
So, is MLOps just another fancy name for DevOps? Since machine learning is also a software system, most of the DevOps practices apply to MLOps too. However, there are some important differences:
- Team skills: a machine learning team usually has Data Scientists or/and ML researchers who may be excellent with different modeling techniques and algorithms but lack the right software engineering skills for building enterprise-grade production systems.
- Continuous Integration (CI) is not only about testing and validating code and components, but also testing and validating data, data schemas, and models.
- Continuous Deployment (CD) goes beyond deploying a package or service. It requires deploying an ML training pipeline that should automatically deploy another service (model prediction service).
- Continuous Testing is unique to ML systems, which is concerned with automatically retraining and serving the models.
- Monitoring - ML uses non-intuitive mathematical functions. It requires constant monitoring to ensure its operating within regulation and that the models are making accurate predictions.
Implementing MLOps with Azure Machine Learning
Tavant’s Manufacturing Analytics Platform (TMAP) is an analytics and machine learning-based platform that provides important business insights to our customers in the manufacturing domain especially Warranty. It is based on Azure and we use Azure Machine Learning’s MLOps features for managing our models’ lifecycle. Here’s a list of features that Azure provides for MLOps:
- Workspace - An Azure Machine Learning workspace is the foundational resource that is used to experiment, train and deploy machine learning models
- Development Environment - Azure ML provides multiple pre-configured ML specific VMs and computes instances. These come with most of the Machine Learning and Deep Learning libraries pre-installed and configured. One can also choose to create a local development environment if required
- Data Set - This step involves connecting to different data sources like Azure Blob Storage, Azure Data Lake, etc. and create a Machine Learning Data set. This Dataset can be used to access the data and its metadata when we create a Run
- Experiment & Runs - An experiment is a logical grouping of all the trials or runs. For each ‘Run’, you can log the metrics, images, data or enable logging. All these will be attached to the corresponding ‘Run’ under the Experiment.
- Compute Target - Creating a compute target helps you run your machine learning training. This compute target can be local or remote Azure GPU/CPU VMs
- Model Training - Azure ML already comes with multiple Estimators for Sklearn, Pytorch, Tensorflow and Keras. These Estimators helps you organize the ML training. Azure ML also has a capability to create Custom Estimators of your choice. All training logs, versions, and details will be logged in under the ‘Run’ of your experiment.
- Model Registry - Once the ML training is complete with different ‘Runs’ and you get the ‘best model’, the next step is to register the model in the Azure Model Registry. Model Registry maintains the model versions, descriptions of the model, model metadata, etc.
- Model Profiler - Before you deploy the model for real-time inference, profiling the infrastructure requirements for the model is very important. Profiling will give you a better understanding of how much minimum memory and CPU’s required for the model to give low latency and high throughput.
- Model Deployment - A model can be deployed to Azure Container Instances or Azure Kubernetes. This step involves providing which model and what version to deploy, its configurations, and the deployment configurations.
- Data Collection – It is used to capture real-time inputs and predictions from a model. It is used to analyze the performance of a model.
- Data Drift – It helps you understand the change in features, data quality issues, natural data shift, change in the relationship between features, etc.
MLOps is a must for enterprises using machine learning at scale. It allows for managing the complete model lifecycle including model governance and should be made mandatory for all Machine Learning projects. Azure Machine Learning provides has a great feature set for implementing MLOps. It does lack some of the advanced features like model lineage but one can always use dedicated MLOps platforms like MLflow or DotScience on Azure to bring in any missing features.