- Involving testers in machine-learning(ML) based projects is not so common, but has been proven extremely beneficial if done.
- If testers understand different components and structure of ML systems, they can test it better.
- Collaborating with developers to explore/execute new test ideas is helpful to ensure quality of ML based systems.
- Deep product knowledge, attention to details and out-of-the box testing ideas can play an important role in testing ML systems.
- Testers can add value in testing both training and deployment of ML models.
Testing is considered as a vital aspect in software development life cycle (SDLC), so testers are involved to assure the application’s quality. This holds true for conventional software systems including web applications, mobile applications, web services etc.
But have you ever wondered about systems based on machine learning? In those cases, testing takes a backseat. And even if some testing is done, it’s done mostly by developers itself. A tester’s role is not clearly portrayed in such areas.
Testers usually struggle to understand ML-based systems and explore what contributions they could make in such projects. So in this article, I’ll share my journey of assuring quality of ML-based systems as a tester. I’ll highlight my challenges, learnings and my success story.
My transition from testing conventional systems to an ML-based system
Like most of the testers, I’ve been a part of testing conventional systems involving web apps, native apps, backend, microservice,s etc. In such systems, input is taken and logic is written by humans (mostly developers) to produce a deterministic output. As testers, our focus is to verify the expected output against the specified/implied requirements.
Interestingly, some years ago I got an opportunity to test an ML-based recommendations system. It’s quite different from conventional systems, so I was excited and anxious at the same time.
In ML systems, a numerous amount of data with patterns is given as input. It is fed to a model, which learns logic to discover these patterns and predict future events.
To ensure quality, it’s important to test learned logic. So I asked myself, how do I test this learning process and the logic learnt by the model? Is it even possible to test this logic? Is a model completely a black box?
Having a lot of such questions in my mind, I was curious to explore and learn. I was all set for a roller-coaster ride 🙂
My first steps and learnings
In my excitement and curiosity to contribute fast, I did what most of us would do – Google the testing ideas! Most resources I referred to were pointing to model evaluation metrics like precision, recall, confusion matrix, etc. They felt like jargons to me. Honestly, I didn’t understand them. But I took back my half-baked learnings to my developers, and they told me that they were already taking these metrics into account.
I realized that the developers were well-versed with their domain. To create an impact as a tester, I needed to bring in the perspectives that they missed.
We, as testers, are blessed with the great skill set of asking the right questions, understanding the big picture, thinking out-of-the-box, applying deeper product knowledge, challenging the status quo, etc. If these skills are applied to test ML systems, a lot of the issues could be prevented.
I also realized I was trying to test something without even understanding how the system works and what the building blocks are. But, to test any system better, this should be the first step.
I discussed this approach with my developers in order to collaborate to understand the system deeply and to apply my testing skills at every phase.
The developers, who were initially skeptical of having a tester onboard, got excited after hearing this plan and were looking forward to the value I could provide.
Understanding the basics of ML systems
Together with my team, I understood that in machine learning, a huge set of data with certain patterns is collected, filtered and fed to a model. The model identifies these patterns and predicts the probability of future events.
For example, our project was to provide article recommendations for our users. So, a huge amount of data like article interactions, users characteristics and user behavior on the platform was collected and fed to the model. The model then learned patterns in the data and formed rules to predict future possibilities of users interacting with the article.
The process is split in two phases:
- Phase A – Learning of model where model learns and identifies the patterns from data and comes up with logic.
- Phase B – Deployment of model where new unseen data is supplied to model. The model applies the learned logic to the new data and makes predictions.
Phase A – Learning process of machine learning systems
The learning phase is where the model learns to identify the patterns from data and comes up with logic to make future predictions.
There are two types of learning processes: supervised learning and unsupervised learning. Our project had supervised learning, where the desired output samples are available in the training data itself.
The detailed learning process was as follows:
Step 1- Data collection: The training of ML systems depends heavily on input data. As a rule of thumb: “The more the data, the better the training”. But generally there’s a tendency to not pay attention to the quality of data.
I would indeed say, “The higher the quality of data, the better the training”.
So, as tester we can help developers check/fix all quality dimensions like:
- Missing values
- Incorrect values/data types
- Security issues
- Business rules
- Outlier data
- and so on
Let’s take a simple example. In our data, a column named ‘article_age’ had negative values in certain cases. This column reflects how many hours ago an article was created, and surely this shouldn’t be negative. We found that these negative values were due to a bug in timezone conversion, so we fixed it to correct the data.
All such checks should be added and integrated in the pipeline to validate data quality.
Step 2- Building of model features: After the data is collected and refined, the next step is to design model features. A table with all data is formed to feed to the model. Each column is called a model feature except one column, which is the target value. All columns have numeric values.
For example, for our system of article recommender, there were features related to user specifics like age, gender, duration of user subscription; article-specific features like time since it was published, total views/impression on the article, total clicks on the article; features about a user’s interaction with articles like how many times a user viewed or clicked the article, etc, and many more.
The target value reflected if the user interacted with the article or not.
To train a model properly, it’s important to select correct features.
As a tester, I helped in reviewing all features, standardizing rules applied to them, and applying product knowledge to find bugs/enhance features. Some examples:
- While calculating the feature of “views on article”, unique user views were counted, meaning if a certain user sees an article multiple times, then it was only counted as one view. So, I ensured that the same rule of unique counting must be applied to other features also like “clicks on article”, “interactions on article,” etc.
- There was data available about the job industry of the user which was not a feature in our model. I challenged the status quo and got this feature added as it could be valuable for model learning (users from the same industry may tend to read the same articles).
An important role that I played was also documenting features with their description, rules, etc.
Step 3 – Training and validation of model: After the data is collected and features are decided, it’s time to feed this data into the model and train it.
This is where the model comes up with rules to predict future target values when new data would be provided to it.
Not all the data should be used to train the model. But the data should be split into training data and testing data. The model should be trained with the training data and the logic that the model learned must be verified with the testing data using different model evaluation metrics like precision, recall, rpAUC, etc.
Some evaluation could also be done by analyzing the importance of each model feature (for e.g. using SHAP values). Testers could help evaluate the feature importance graph and ask the right questions.
For example, in our model which recommends articles, I realized that the feature about age of the article was considered as least important by the model prediction. I was surprised and raised this issue because if the model would learn that age is the least important thing in data, then it could recommend old and new articles with the same probabilty. This could be problematic for time-relevant articles where newly published articles are much more important that old published articles. Developers found that some error happened during training which resulted in this bug, so the model was retrained.
PHASE B – Deployment of (new version of) model
Now that the model is trained with the past data, the next step is to provide new unseen data to predict the probability of future events, which basically means deploying the new model to production.
To test the deployment of a new model, it’s important to understand its setup in production.
For example, in our case with article recommenders, when a user requested for recommendations, an API request was made to get predictions of the model. On top of the result of the model, some filters and re-rankers were also applied. Filters referred to business rules which remove some items predicted by the model. Re-rankers were business rules which shuffle some of the items predicted by the model.
In the above setup, whenever we planned to deploy the newly trained model to replace the existing one, certain tests were performed before replacing to confirm that it’s worth replacing.
We performed a comparison analysis between the final result users would see with the old and the new version. We collected top X items from the output of both versions and compared them for different metrics like how many items are changed/same, what’s the effect of the average article age, what’s the effect on item diversity, etc.
Carefully analyzing these results helped us to decide whether to deploy the new model or not. For example, once we noticed that in the new model, the average age of top 10 articles shown to users decreased by a huge amount & this was a blocker for us to deploy because we didn’t want to show older content to users when new articles were available.
Performing such tests always gave us confidence in the new model’s quality.
In addition, whenever there was a change in rule of filter or re-rankers, I tested the relevant change and sometimes even found crucial bugs. For instance, we wanted to add a filter which should keep only two articles having the same author and remove the rest of the articles from that author. But I noticed that the filter was allowing three articles from the same, instead of two articles. Developers realized this bug got introduced due to unclear requests and fixed it. This small fix indeed changed the results for our users to great extent.
So, it’s definitely crucial to test before deploying whenever any changes are made in the setup.
Having said that, ultimately our users are our final testers. So, instead of just releasing the new model, we always did an A/B test where the user group is split into 2 groups. Users in group A received the old version and users in group B received the new version. After letting the test run for a certain duration (a few weeks), we collected data from both groups and analyzed if the new model/reranker/filter actually performed better based on metrics like click through ratio, scroll distance, engagement with items etc. Based on results, we decided if the new version is worth rolling out to everyone. This data is then collected & used to retrain the new model as well..
My learnings from my journey
I was very skeptical about my role in an ML system project in the beginning.
But reflecting back on my journey, I’ve learnt how big of an impact testers can make. I’ve learnt what blunders could happen if testing is not done. I’ve learnt how testers can closely collaborate with developers and find problems even with ML systems.
In short, I’ve learnt that it’s high time to change the perception that “testers cannot contribute towards ML systems”
Thanks to Rahul Verma, Prabhu & Ben Linders for motivating me to write this article and thoroughly reviewing it to help bring it to its current shape.