Skip to main content Skip to secondary navigation

Image Labeling & NLP

Main content start

Below is a list of active and ongoing projects from our lab group members. To learn more, click on the project links otherwise reach out to us via email.


Cross-Modal Weak Supervision: Leveraging Text Data at Training Time to Train Image Classifiers More Efficiently

Arguably the largest development bottleneck in machine learning today is getting labeled training data. One promising direction is the use of weaker supervision that is noisier and lower-quality, but can be provided more efficiently and at a higher level by domain experts and then denoised automatically.  In one current project, Snorkel, users write labeling functions to express heuristics that can generate noisy labels. These labeling functions are often easy to write over text, but less so over images. However, in many important cases we have both images and text available at training time: for example, in radiology applications, we want to train an image classifier, but also have unstructured text reports available at training time. In this project, we are exploring how this text data can be used to help more easily provide weak supervision for the end image model.

Link to Snorkel project page


Coral: Inferring Generative Model Structure with Static Analysis

In our work, we focus on the problem of gathering enough labeled training data for machine learning models, especially deep learning. We build on the Snorkel model in which users write labeling functions to label training data, noisily.  We explore how we can use weak supervision for non-text domains, like video and images. To facilitate writing heuristics for such data, we introduce the idea of domain-specific primitives, interpretable characteristics of the data that are easy to write heuristics over. For natural images, these include bounding box and segmentation attributes, like location and size. In the medical domain, these can include characteristics like the area, intensity or more complex shape and edge-based features for a region of interest (ROI).

A side effect of using these small set of primitives to write heuristics is that the primitives tend to be shared extensively across the heuristics. We rely on the fact that these heuristics and primitives are developed programmatically to find correlations among these constructs using a combination of static and statistical analysis. In work described in the paper, we applied this combination of domain-specific primitives and labeling functions to bone tumor X-rays to label large amounts of unlabeled data as having an aggressive or nonaggressive tumor. Using a large amount of unlabeled data and modeling correlations among the heuristics allow us to outperform a model trained on a small dataset with ground truth labels. In ongoing work, we hope to make the process of developing these labeling functions even easier by closing the semantic gap between the quantitative nature of the primitives and the terms domain experts use to describe their data.

Read Paper (NIPS 2017)


Other Projects

  • Predicting the outcome of imaging examinations with deep learning analysis of structured and unstructured EMR data
  • Convolutional neural networks in radiology text analysis - comparing CNN, LSTM, and newer models to rules based systems
  • Investigating scalable automated image examination label extraction from radiology reports with deep learning techniques
  • Using deep learning to label pathology images using the medical record
  • Large scale EMR phenotyping using deep learning