An article recently published in Nature proposes a new way to evaluate data quality for artificial intelligence used in healthcare.
Several documentation efforts and frameworks already exist to evaluate AI models, like FactSheets, Model cards and Dataset Nutrition Labels. However, the authors write that none comprehensively assess the content of data sets and their suitability for use in ML.
The German researchers Daniel Schwabe, Katinka Becker, Martin Seyferth, Andreas Klaß and Tobias Schaeffter sought to identify which characteristics should be used to evaluate data quality for trustworthy AI in medicine. The factors can also help explain why a model behaves a certain way.
The authors developed the METRIC framework, a specialized data quality framework for medical training data. It has five categories and 15 sub-dimensions through which researchers and healthcare entities can evaluate their data fitness for the task at hand.
The categories comprise measurement process, timeliness, representativeness, informativeness and consistency to assess the appropriateness of a data set with respect to a specific use case
The researchers note that developers should familiarize themselves with the aspects of the framework and begin to use them to evaluate their data. They say more work needs to be done to establish quantitative and qualitative measures for each dimension.
The researchers performed a literature review using Web of Science, PubMed and ACM Digital Library and found 120 papers that met their criteria. Within those papers, they found over 450 data quality measures for healthcare data.
The authors distilled the terms down to 15 considerations, or “dimensions” that they say healthcare entities should use to determine the quality of their data.
“Data quality plays a decisive role in the creation of trustworthy AI and assessing the quality of a data set is of utmost importance to AI developers, as well as regulators and notified bodies,” the study says.
The first of the five categories, measurement process, assesses uncertainty during data collection. It takes account of missing data, device error, human error and noisy labeling. The measurement process category considers the accuracy and precision of the data and the level of noise in the training data compared to the expected noise in the data after AI deployment.
“The entire cluster measurement process is crucial for data quality evaluation in the medical field since errors may propagate through the ML model and lead to false diagnosis or treatment of patients,” the study says.
Timeliness relates when the data were collected and updated and if they work with current standards such as indications for diagnosis and current medical coding practices.
The third dimension is representativeness, or the extent to which the data represent the targeted population.
Representativeness includes the demographic mix of the data, the variety of the data sources, the depth of the data and the target class balance. The paper says target class balance is an especially important factor in machine learning so the algorithm can learn patterns for specific classes from the training data. For example, model developers may have to overrepresent rare diseases to maintain balance in the class ratio.
The informativeness of the data set considers whether the data provide clear information. Factors that influence informativeness include understandability of the data, reduction of duplicate or redundant records, and if the patterns of missing values provide additional information.
The last category in the METRIC framework is consistency. The framework considers rule-based consistency, logical consistency, distribution consistency of the data.
“With training data being the basis for almost all medical AI applications, the assessment of its quality gains more and more attention,” the study says. “However, we note that providing a division of the term data quality into data quality dimensions is only the first step on the way to overall data quality assessment.”