Diffusion models in data scarce regions

...without the equations.

Aug 01, 2024

The late Stephen Hawking said that each equation in a popular science book halves its sales. Today we’re looking at Context-Guided Diffusion for Out-of-Distribution Molecular and Protein Design with the additional challenge of expressing mathematical concepts in plain English.

This paper addresses an important problem in the application of AI in science: the scarcity of experimental data and the unreliable predictions of models outside their training distribution. It also discusses a key challenge in molecular design: scientists are eager to design molecules—i.e., generate samples from a generative model—that maximise a specific property for which some data exists, though typically only within the "already explored" regions of space.

Empirically, the authors observe that when samples generated are distant from the training distribution, guidance models predicting sample’s properties have a propensity to suggest high (promising) numbers with undeserved high confidence. To quote: “[for poorly understood regions of the input domain] overconfident guidance signals risk steering the generative process toward false-positive regions of chemical or protein sequence space.”

Unconditional Diffusion Model Explanation

The diffusion-based approach aims to learn how to generate a sample that resembles one from a training dataset, or more correctly, one that came from the same statistical distribution as the training data. Let's use images as an understandable example. Assume pixel values range from 0.0 to 1.0, and we accept T = ~1000 steps as a reasonable heuristic:

Create T images where the image at step 0 is the original, and the image at step T appears as if generated by normal noise.
At each step t in [0,T], for the next iteration of the image, every pixel value is drawn away from the original image towards the mean (0.5 - akin to the grey of November London sky).
A tiny bit of noise is added to every pixel at each of the T steps.

Once the training data is prepared, starting with N training images, we should now have N*T image pairs. We "simply" train a network to predict image pixel values going backwards in time one time step (1/1000) from pure noise to structured image. So from image[n][t] to image[n][t-1].

Figure 1. Illustration of the noising denoising process from “Score-based generative modelling…” Song, Yang, et al, ICLR 2021

Guided (Conditional) Diffusion Model Explanation

The reasoning above can be extended by breaking the dataset into subclasses, where, apart from the input sample, a class label is provided. Both pieces of information can be supplied jointly to the diffusion model. Further, one can imagine not just a class label but a free text description encoded into a fixed-length pre-trained embedding vector accompanying the image. However, it's crucial to underline a significant difference in regression-based modelling, where the goal is to maximise or minimise said value for a previously unseen, generated output sample.

An important distinction is that in image generation, the text description embedder is typically trained jointly with the image generator to integration of conditioning information, and consequently descriptive user prompts. In regression models for tasks like molecular optimisation, the actual value of the target label is used as one of the objectives during training of the guidance model. The model is explicitly trained to predict and optimise for specific target values, which guides the generation process accordingly.

Standard Guided Model Training

A typical model will consist of three parts:

An embedding function H that converts the sample during its formation (at any point between 0 and T) into a fixed-dimensional embedding.
A mean-predicting function that operates on the embedding to predict the mean value, or k-dimensional values if that’s the target.
An equal dimensionality variance-predicting function for the predicted value.

The authors propose that the mean is well predicted via an inner product of the embeddings with a matrix of learned parameters, and that an exponent of the inner product with a second set of learned parameters is effective for variance prediction. H ideally should also be learnable, and experimentation with both warm-started pre-trained embeddings and training H from scratch is recommended. During training, regularisation of the learned parameters is performed by adding the network’s parameter L2 norm to the overall loss.

Context Aware Guided Model Training

Now the authors introduce an additional regularisation that increases the prediction uncertainty in areas with missing labels. This approach is considered suitable as it increases the likelihood of discovering more true positives. The authors specifically reference molecular design, highlighting how finding that "needle in a haystack" molecule is an attractive prospect. More positives (both true and false) and fewer false negatives (missed opportunities) are thus to be investigated. To quote: “(this) enables the conditional denoising process to focus on regions in data space that are near the training data and have the highest likelihood of containing molecules with improved properties.”

The authors work on batches of sample sizes M and analyse how embeddings of each of the M batch elements covary. They construct a square MxM matrix holding information about the similarities between all pairs from the M-sized batch, enhancing the smoothness of the input embedding space. This method helps the model avoid predicting extreme values when target values are unknown.

Combining these two objectives—the standard conditional of matching the predicted values to the labels, and the enhanced smoothness of the embedding space in regions without labels—leads to improved quality of predictions, which the authors demonstrate across a suite of benchmarks.

Summary

It’s great to see such a formally rigorous approach tackling a real problem in biological sciences and, particularly, in molecular design, given the vast unexplored chemical space.

Acknowledgments

Shout out to the paper’s author, Leo Klarner, for helpful clarifications around the motivations for particular choices of targets for unlabelled data and the effects of smoothness constraints.

#embeddings

Diffusion models in data scarce regions

...without the equations.

Discussion about this post