In machine learning, the performances of machine learning models are fixed most of the time when the environment and data which the model is consuming are fixed. But if the environment and data get changed, the performance of the model also changes. In the scenario where these variables are not stable, we are required to make a model which can deal with the changes. Changes in the performance of the model can be defined as the model drift. In this article, we will be discussing model drifting and how the drift can be automated to achieve a complete availability of the model. The major points to be covered in this article are listed below.
Table of Contents
Challenges in Machine learning?
As we know that machine learning models are different from other traditional models when it comes to dealing with the data and performance of models. Since most of the machine learning models work on the basis of knowledge gained by the old transaction or formally we can say the performance of the model depends on the data from which they have learned and the data which is going to be the part of the model as the input. As the input varies the performance of the model also varies. By above, we can say that after deployment of the machine learning models there is always a requirement to monitor the machine learning models in real-time so that the performance of the models can be maintained to a certain range.
We can monitor the performance of machine learning on the basis of metrics like precision, AUC, recall, etc. however these metrics require labels for the predictions provided by the models in real-time. We normally see that the labels are presented in the training information but there can be a possibility that in production the labels in the data are not available. In the absence of labelled inputs, these performance metrics can be used to measure the changes in the performance of the model. By making some visualizations they can be helpful as the indicator of the performance issue.
What is Model Drift?
We can define model drift as the change in the performance of the model due to changes in the information or data or due to changes in the relationship between the input and output variables. We can say that if a model is producing predictive results that change compared to expected results according to the parameters set while training the model.
When it comes to the production machine learning models, we can say the model drift is changing between the data on which the model is trained and the data which the model is producing in real-time which causes the changes in the prediction level. The reasons for changes in the production data can be changed in the environment of the model or production of the data. we can say that there are four main types of model drift:
So here we can say that the concept drift is a dissimilarity between the real and learned decision boundary by the model. It becomes necessary to make the model learn or train again on the data so that we can maintain the accuracy range and error rates produced by the model. Model drift can be an indicator of the unavailability of real ground truth labels. And also the model drift is an indicator of the changes in the environment. However, if we are capable of measuring the cause of model drift we can make a model decision boundaries with a tolerance so that the model can predict accurately even when any kind of drift is presented. There can be many causes of the drift, some of them are listed below.
Causes of Model Drift
There can be many reasons for drift to occur in machine learning models
Due to externalities, there can be many changes in the data distributions. In such cases, we are required to perform the modelling procedure again with the updated data set. For example changes in the email categories due to change in the business type.
When this type of issue occurs in the data we are required to perform an investigation on the data for example faulty data engineering can cause the change in the data even if we have entered the correct data in the source. Or we enter wrong data in the source.
How to Detect Model Drift?
There can be various methods through which we can detect the model drift some of them are listed below.
This can be considered as an accurate way of detecting model drift which can be done by comparing the model predicted values to the actual models. We can observe a drift if the predicted value has deviated much from the actual values.
There are various metrics that can be used for the measurement of accuracy. One of the famous metrics is the F1 score which encompasses both the precision and recall of the machine learning model.
The image is a representation of the precision and recall of predictive modelling. Whenever a metric falls out from the range of a threshold we can assume that there is a model drift.
This test is basically a nonparametric test that can be used for making a comparison of cumulative distribution between the data sets. To measure the model drift we can use this for making comparisons between training data and post-training data. Where the null hypothesis of this test indicates that the distribution of the datasets which are being compared has the identical distribution and in our case, rejection of the null hypothesis is an indication of the model.
This test can give a measurement of the changes in the variable distribution over time. PSI is a famous metric for measuring changes in the population’s characteristics. And this can help us in measuring the model drift.
Using the Z-score we can compare the feature distribution between the two datasets in our case we can compare between the training and produced data of the model. if several produced data points of a given variable have a z-score of +/- 3, there is a shift in the distribution.
Ways to Automating Model Drift
However, we know that model drift is something about the increased losses produced by the machine learning model which can be detected by the above-given method. But when it comes to production it really becomes necessary to provide a solution to model drift and manually dealing with the model drift becomes cost consuming and time-consuming. So, we are required to deal with model drift in an automatic way where modelling techniques can automatically detect the model drift and can perform required changes in the model or in the data. In the next section, we will see how many ways we can automate the model drift.
We can say there can be various ways using which we can automate model drift. Some of them are listed below.
Online machine learning is a prominent way to deal with model drift because it allows us to update learners in real-time and models allow us to deal with one sample at a time. Because in online learning the models are learned in a setting where it takes the batches of samples with the time and the learner optimizes the batch of data in one go. Where the model finds out the relationship between the independent and dependent variables. Since these models work on the fixed parameters of a data stream they are required to retrain the new patterns of the data.
Also, these models are capable of learning from the large data streams they can be applied to different domains like time series forecasting, movie or eCommerce recommender systems, spam filtering, and many more, where the changes in data occur frequently.
The above image is a representation of the basic online learning procedure where a model is used to predict the dependent variable instance and which can be used for upcoming new dependent variable instances and also it causes the model to be updated every time it gets used for making predictions. One way to perform online learning can be done by the creme library based on python. You can find out a tutorial at this link.
Using Azure ML we can automatically identify the model drift especially the data drift. The basic need for the procedure is to integrate the models into the Azure ML workspace. We can select the features to identify the model drift. Since it uses different methods mainly statistical methods and different time windows to identify the drift.
The above image is a representation of how the dataset differs from the target dataset in the specified time period when we integrate the model with Azure ML. Since it uses the python codes for managing the data drift it is an easy way to drift detection. Also, they have provided the tutorial for the procedure in this link. Using it we can perform the following monitoring:-
In this article we had a basic understanding of the model drift and also we have seen some types of model drift. There is always a need to make a model error less which can be done by measuring the drift in that sense we have seen different techniques for measuring the model drift. In the end, we have discussed how we can monitor them automatically through different approaches.
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.
Copyright Analytics India Magazine Pvt Ltd