"80% of Machine Learning is simply data cleaning" is a commonly used phrase in the AI/ML community. Data quality plays a crucial role in Machine Learning (ML) models’ performance. It can be the difference between the success and failure of a AI/ML project.
While ML models may perform well in the training environment, most under-perform massively in the real-world. To mitigate the problem, datasets need to be carefully curated and labeled so that the they cover a variety of scenarios and the labels are consistent and accurate.
But how do we accurately measure the Quality of data? In fact how do we even define a proper Quality measure in this scenario?
Accuracy measures how close the data labels are to ground-truth or how well the labeled features in the data match real-world conditions. While its important that your your data is accurate, its also important that the labels are consistent throughout the dataset. Inconsistent labeling is one of the most common reasons behind under-performing ML models.
In order to avoid inconsistency among labels and to make sure all stakeholders are on the same page on the actual requirements of a project, it is important to properly write an "Annotation Guide". This document precisely defines the labels involved in the project and provides instructions on how to deal with confusing scenarios.
At Mindkosh, we always make sure that the annotation guide is properly detailed by going through a mock labeling session on a sample of data. The idea is to catch these issues early in the project's lifecycle, which can save a lot of time later on. We also make sure there are open lines of communication between all stakeholders, so that any doubts/problems can be immediately resolved.
Accuracy Measures for training data
There are a few different accuracy measures that can be used to keep track of the quality of data. From our experience here at Mindkosh, the following work the best for a variety of use-cases.
Consensus measures the percentage of agreement between multiple annotators. The idea is to find a label for each item that most agrees with all labelers. An auditor typically arbitrates any disagreement amongst the overlapped judgments. Consensus can be performed by assigning a certain number of reviewers per data point. It can also be performed by assigning a portion of the whole dataset to more than 1 labelers.
To calculate a consensus score, divide the total number of agreeing labels by the total number of labels per data-point.
If you have some portion of the data already labeled, and you believe it conforms to your data quality standards, honey pot can be a good way to measure the average quality of the dataset.
To setup a Honeypot, include the already labeled data with the data you send for annotation, without alerting the labelers about which data-points are already labeled. You then match the new labels with the ones you already have to arrive at an approximate measure of the data. Of all the measures, this is the easiest to setup, and can be done completely on your end rather than the labelers' end.
This method is based on the review of label accuracy by a domain expert. Usually review is done by randomly checking a small portion of the total data for any mislabels.
At Mindkosh, we check 5% of the total data for each batch of data. If the accuracy does not match an agreed upon threshold, the entire batch is re-labeled free of cost.
Cronbach’s alpha is a measure of internal consistency, that is, how closely related a set of items are as a group. It can be written as a function of the number of test items and the average inter-correlation among the items.
In order to keep track of the quality of your data in a systematic way, it is important that proper QA processes are set in place with your Data Annotation vendor. You should always inquire about how quality is measured and what happens if the measured quality does not match an agreed upon threshold.
At Mindkosh, a typical workflow involves each data point going through at-least two human annotators - first for labeling, and then for review. In addition, a small portion (usually 5%) of the data is subjected to a Quality check by a domain expert who establishes the quality of the batch of data.