Algorithmic Adventures in Medicine: My Summer 2023 AIMI Internship Experience
By: Shreeya Chand
Hi! My name is Shreeya Chand and I’m a high school senior from Maryland, interested in using data science and machine learning for social good. I’m also a singer, GeoGuessr player, and kombucha enthusiast. This past summer, I interned at the Stanford Center for Artificial Intelligence in Medicine and Imaging and had the incredible opportunity to learn from trailblazers in AI and healthcare, work on a hands-on research project, and meet other amazing students, especially my team of fellow high schoolers: Maryam Solaiman, Yui Leung, Isabella Zarar, Sahasra Kotarkonda, as well as our team lead, Kameron Gano, a student at UC San Diego.
A typical day in the internship consisted of a lecture from our mentors about a subject in AI, a team breakout session to collaborate and apply the lecture’s content to the project, and a “Lunch & Learn” with a speaker using AI in the healthcare sector, where we’d have the opportunity to ask them questions about their work and new developments in the field.
Our task over the course of the two-week experience was to train convolutional neural networks (CNNs) to determine the probability that an endotracheal tube was positioned correctly (classification) and determine the distance between the carina and the tube (regression) as accurately as possible, given a chest X-ray. Quickly determining the placement of the tube is crucial because if the tube is too close to or too far from the carina, this can lead to serious complications for the patient.
Our project would also involve natural language processing (NLP), since the distances were not explicitly given and we had to extract them from the associated radiology reports.
Sounds complicated, right? Luckily, what initially seemed a daunting task became more approachable when I met my team. As it turned out, we were working from four different time zones! Kameron, Isabella, and Maryam were in California, Sahasra was in Chicago, Yui was in Hong Kong, and I was in Maryland. I would sometimes wake up and see messages delivered at 3 am, sending me into a brief state of shock before I remembered this. Tasked with creating a team name, we decided to call ourselves the “pAIn killers,” blending the internship’s themes of AI and medicine. It gave everyone a good laugh but also reflected our hope to use AI to make life easier for people.
On day two, Dr. Eduardo Reis, one of our mentors, introduced us to chest X-rays, including how they are taken and how to locate the carina and endotracheal tube on the image. Even though many of us had little previous exposure to radiology, it was incredibly interesting to hear about the real-life applications of electrochemistry and electron spectroscopy, which we had learned about in chemistry classes (and never thought would be used again)! Learning how X-rays are taken and interpreted also provided important context about the models we’d be training. For example, we learned that chest X-rays can be taken anterior-posterior (AP) or posterior-anterior (PA), depending on which direction the X-rays pass through the body. PA is a more reliable view, but certain factors, like a patient’s inability to get out of bed, can necessitate taking an AP X-ray. We then knew that this could potentially introduce problems in training the models, because they could be “overfit,” or too specific to the PA-dominated training images and be bad at making predictions about AP images, similar to how the tweet discusses a dog that frames its interactions with trash cans around a singular data point.
We used the MD.ai platform to practice labeling the carina and endotracheal tube on chest X-ray images. It was really challenging! While that was partly because we are not radiologists, it also speaks to the “superhuman” abilities of AI to pick out patterns that humans can’t easily see. Researchers from Emory University recently even created a model for determining a patient’s race from their chest X-ray!
Thankfully, we didn’t have to label each chest X-ray image from the MIMIC dataset, which includes thousands of chest X-ray images and radiology reports written by clinicians during routine clinical care (though some teams actually did some of their own labeling to train their models, for which I applaud them). Instead, we’d be using the radiology reports associated with each X-ray.
Unfortunately, not all of the reports were formatted the same way, so we couldn’t do basic parsing to extract the distance from each report.
At a Lunch & Learn, we heard from Katie Link, the lead of machine learning for healthcare and biomedical applications at HuggingFace, who discussed the machine learning platform’s role in democratizing ML by hosting many open source models and datasets. We were able to search HuggingFace for models we could use to extract the distances from the radiology reports.
Large language models split their input and output text into “tokens” which are generally words or significant parts of words. You can see for yourself how text is tokenized in ChatGPT using OpenAI's tokenizer. Tokens are often assigned labels through token classification models.
We found d4data/biomedical-ner-all, a token classification model that does named entity recognition, or categorizes “entities” specifically for biomedical text and used the entities labeled as “Distance.” Radiology reports encapsulate a lot of information, so it can be challenging for an algorithm to decipher and interpret them correctly. Although our approach was mostly effective, it did not work when there were other measurements given in the radiology report because we had no way to differentiate between them and needed the surrounding context of each to determine which to return. Katie Link had stressed to us the importance of failure, and how it drives success. We truly saw that here when our named entity recognition approach to the NLP component turned out to be ineffective. We benefited from the trial and error as it gave us more perspective on our project and pointed us in the direction of a more successful approach. During his Lunch & Learn, Dr. Matt Lungren also emphasized that obstacles in our educational journey are not insurmountable.
For our next approach to NLP, we used question-answering models from HuggingFace that we could specifically ask to find the measurement we needed. We could pass a question like, “What is the distance between the endotracheal tube and the carina?” along with the report text to the model. This approach did not result in 100% accuracy, but incorrect labels were infrequent. On top of this question-answering approach, we also did more processing with Python and regex to standardize distances in different units. For the classification model, we further categorized each distance as normal or abnormal. Ultimately, our NLP step resulted in a loss of 600-700 data points (due to there being no distance in the report or the model being unable to find it). With the original token classification model, however, we would have lost double that amount. NLP was very time-consuming, but with Kameron’s code that allowed us to save our file every ~1000 data points and his advice to run our code locally, we saved valuable time for model training.
Next, we created Datasets and DataLoaders in PyTorch. We randomly split the available data into a training subset (80%) and a validation subset (20%). After each epoch, or one complete pass of the training data through the training loop, the models were evaluated using the validation data. After training, each team submitted their models’ predictions on the testing data to the AIMI summer program leaderboard, where we could see the evaluation and ranking of our models in real time.
We loaded in pretrained model architectures that we researched and found to be suitable for our task. This way, we would be using tried and true models for imaging tasks and wouldn’t have to train them as much to specialize for the task at hand, saving valuable time and computing resources. This method is known as “transfer learning”. We also “froze” the early convolutional layers since they were already trained to pick out just the basic features like edges, whereas later layers would be trained to pick out more specific features. Model training did not change the weights of these “frozen” layers, saving even more training time.
We used the ResNet-50 model for classification and the AlexNet model for regression. In both tasks, we wanted a single output, so we modified the last layer of the model to have just one output node. We also modified the output activation function for each model; a sigmoid function to ensure probabilities between 0 and 1 for classification and ReLU to ensure nonnegative distances for regression.
For classification, we used a method called “ensembling,” which is essentially averaging the predictions of multiple models to get more accurate ones. The idea is that the different models may have been trained to recognize different features so combined, they “put their heads together” to make more informed predictions.
Our first ensemble model had a decent accuracy rate on both the validation and testing data, but we decided we could do even better. For the next ensemble model, we trained one ResNet-50 model that was pre trained on a chest X-ray data set called Chest-XRay-14. While the original model wasn’t trained to do our specific task, its weights were still immensely useful to us because it had likely been trained to pick out key features on chest X-rays.
The classification model submissions were ranked using the AUROC (area under an ROC curve) metric, which averages true positive and false positive rates across various decision thresholds so the model can be evaluated irrespective of an arbitrarily chosen decision threshold, the choice of which often depends on the clinical context and balance of positive and negative classes in the data. This model performed even better than the last, and got us to second place out of the five teams.
To evaluate the performance of the regression model during training, we used L1 loss (Mean Absolute Error) since that was the metric our predictions on the testing data would be evaluated on and it was an intuitive way to gauge how good our predictions were (i.e. the MAE is how many cm the predictions are off by on average). For our first submission, our MAE was 1.36, meaning our predictions were only off by 1.36 cm on average!
To understand the model’s training progress, we graphed the loss epoch by epoch and analyzed the trends in the validation and training losses. We noticed that they started to diverge after a while, with validation staying about the same and training descending rapidly. This indicated that the model was overfitting slightly to the training data, with a healthy gap between the validation and training performance.
To improve model performance, we used saved model weights from an epoch that had a low validation loss and would be better at generalizing for the testing data. The new submission to the leaderboard was only slightly better with a MAE of 1.33, but hey, improvement is improvement! We ended up getting second place for regression as well.
Lunch & Learn speaker Dr. Mirabela Rusu, Assistant Professor of Radiology and Urology at Stanford University, discussed the importance looking at the bigger picture of our projects and always keeping the next step or project in view, so we thought about how the next step of such a project may be to get healthcare institutions to trust and adopt our model to make their imaging analyses more efficient. Throughout the internship, we discussed how AI is often seen as a “black box”: we feed in our inputs, get outputs, and what is in between is simply magic that we can’t understand. In the Lunch & Learn with Dr. Curt Langlotz, director of the AIMI Center, we learned about a technique known as Grad-CAM that “highlights” the “important” parts of the image. Grad-CAM uses the gradients which measure the change in the output with respect to the neural network’s convolutional feature map. These gradients, combined with associated feature maps can be visualized to determine what features impact the output most.
We used the last two days of the internship to focus on leveraging this technique and see if we could get any insights about our models’ inner workings. After a LOT of debugging, we got some viable visualizations! They weren’t perfect, since they only gave insight into the specific layer(s) they were run on, but it was still cool to see how the model had some degree of logic in making its predictions, yet also how there was an element of randomness. During the internship, we often discussed barriers to adoption of AI technologies, which requires human trust. The explainability that GradCAM provides seems to me to be a good initial step in building this trust.
Another important element of this internship was effective scientific communication, stressed by Dr. Rusu, who discussed the importance of making results more accessible to non-experts. We had all worked on the same project, but how well could we communicate our process? What visualizations and explanations were needed to make our work understandable by not only our experienced mentors but also our families watching, who did not know nearly as much about our project? These were the questions we grappled with as we got ready to present our work at the final ceremony, where we had just 12-15 minutes to discuss our project work, and 5 minutes to answer questions from mentors.
Our hard work paid off, and we won first place for our presentation!
Overall, the internship was an incredible experience. It amplified our interest in artificial intelligence for medicine by allowing us to experience how rewarding it is to work at the intersection of the two fields. From Dr. Rusu, we learned that it is okay to “pivot” often in your career, inspiring us to pursue more of our academic interests without feeling compelled to go down a singular path. From Katie Link, we gained an understanding of the difference between research and engineering - and how we can pursue our interest in both by being a research engineer. Hearing from AI pioneers like Drs. Daniel Yang, Curt Langlotz, and Matt Lungren, who are both physicians and AI experts, through the Lunch & Learns showed us a new side to being an engineer, one where we can interact directly with the patients who my work benefits. We learned about AI holistically - not just about how a model is trained and evaluated, but also the human side to it, where we think consciously about the broader implications of our code for people around the world.
Finally, it was wonderful to meet so many amazing students and professionals similarly passionate about advancing healthcare using artificial intelligence. Even though the experience was virtual, we were a community working toward the same goals.
All of our growth and success would not have been possible without our amazing student lead, Kameron, who gave us invaluable guidance and motivated us throughout and my fellow teammates, Isabella, Maryam, Sahasra, and Yui who each brought their unique experiences and abilities to our team and made us immensely successful!
A huge thank-you also goes to Alaa Youssef, Johanna Kim, Michelle Phung, and Jacqueline Thomas for coordinating the internship as well as Eduardo Pontes Reis, Jean-Benoit Delbrouck, Louis Blankemeier, and Maya Varma for supporting our educational journey throughout the experience. This has been an incredible start to my journey in AI for medicine and imaging and I can’t wait to continue to explore this field!