AIMI Journal Club: Navigating the Pitfalls of Applying Machine Learning in Genomics - Katherine Pollard, PhD
The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.
Dr. Katherine S. Pollard is Director of the Gladstone Institute of Data Science & Biotechnology, Investigator at the Chan Zuckerberg Biohub, and Professor in the Department of Epidemiology & Biostatistics and Bioinformatics Graduate Program at UCSF. Her lab develops statistical models and open source bioinformatics software for the analysis of massive genomic datasets. Previously, Dr. Pollard was an assistant professor in the University of California, Davis Genome Center and Department of Statistics. She earned her PhD in Biostatistics from the University of California, Berkeley and was a comparative genomics postdoctoral fellow at the University of California, Santa Cruz. She was awarded the Thomas J. Watson Fellowship, the Sloan Research Fellowship, and the Alumna of the Year from UC Berkeley. She is a Fellow of the International Society for Computational Biology, American Institute for Medical and Biological Engineering, and of the California Academy of Sciences.