Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. In this talk, we will present our work on "active label cleaning", a data-driven approach to prioritising samples for re-annotation. We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a new medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed active label cleaning enables correcting labels up to 4 times more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.
Melanie Bernhardt is an Applied Researcher in the InnerEye team at Microsoft Research Cambridge (UK). Project InnerEye develops ML techniques to augment clinicians to help them to cope with the growing demand on healthcare, help deliver precision medicine for better patient outcomes, and understanding how we can combine medical imaging with other types of data to change the way we do medicine today. In this talk, I will present one of our team's latest work "Active label cleaning: Improving dataset quality under resource constraints".
Before joining MSR Cambridge two years ago, I graduated from ETH Zurich with a MSc in Data Science focused on ML for Healthcare. Prior to this, I obtained a B.Sc. in Mathematics and a postgraduate diploma in Statistics from the University of Strasbourg (France).