9 minute read

Are you a deep learning (DL) practitioner who has always been using SGD/Adam? If so, our new optimizer, IVON, might be a better option for your next DL projects.

(And no, I promise IVON isn’t yet another obscure optimizer that performs 1% better while being 100% slower :wink:)

IVON is proposed in our paper “Variational Learning is Effective for Large Deep Networks”, accepted at ICML 2024 as a spotlight paper. This work results from a long collaboration dating back to 2021, and I had the pleasure to implement and revise this optimizer many times with my own hands, and witness how a great idea can be realized and gradually improved.

The goal of this post is to offer an easy introduction to IVON for the general audience, and explain why and how it can help you train neural networks more effectively.

For researchers who wish to understand the technical details, I would recommend this excellent post written by none other than Emti himself, the “mastermind” behind this project and an expert in the field. And of course, please also feel free to read our paper.

So, what is “eye-von” exactly?

Glad you pronounce it right! :grin: IVON is an optimizer derived from a Bayesian modeling of network parameters, and it uses natural gradient mean-field variational inference.

IVON updates. See our paper for notations and further details.

Don’t worry if you don’t understand these technical terms. IVON is, just like SGD or Adam, a general-purpose neural network optimizer, and it can be used as a drop-in replacement in your deep learning projects.

What sets IVON apart is that it does not look for a specific set of parameters like SGD or Adam, but instead accounts for the uncertainty of model choice and tries to fit a Gaussian distribution over the most probable parameter configurations.

Instead of finding a specific set of parameters, IVON fits a Gaussian distribution over the most probable parameter configurations.

This can be interesting especially for two reasons:

  1. It locates flat minima in the loss landscape, which is hypothesized to generalize better (i.e. better test-time performance).
  2. The learned Gaussian encodes additional variance information which, as we will see shortly, opens many new possibilities.

We have provided a Google Colab notebook for the illustrated 2D example, feel free to try it out!

Why use IVON?

There are already so many advanced DL optimizers out there, why would IVON deserve a spotlight? For me, there are three main reasons which make IVON special.

IVON is fast and scalable

It is easy to outperform the rather basic SGD and Adam optimizers, but harder to do so without significant performance overhead.

Luckily, IVON is both fast and scalable. Our experiments show that (see figure below), for image classification tasks on CIFAR-10/ImageNet, IVON exhibits nearly no overhead; For GPT-2 training, IVON has less than 20% runtime overhead compared to Adam.1

Oh, did I mention that IVON is the first variational learning method to successfully train GPT-2? We are able to scale IVON to train a GPT-2 model with 773 million parameters on OpenWebText from scratch, and we achieve better results compared to AdamW2.

IVON improves test results and uncertainty estimation

Yes, rest assured that IVON can consistently outperform SGD and Adam in our experiments, for both CV and NLP tasks.

IVON results on ImageNet using ResNet50 (left) and OpenWebText with GPT-2 models (right), outperforming the respective standard baselines without significant overhead. This shows that IVON is effective for large-scale CV and NLP tasks.

Being a Bayesian deep learning (BDL) method, IVON also enables high-quality uncertainty estimation. In fact, an earlier version of IVON won first place in the NeurIPS BDL competition in 2021.

Probabilistic predictions of a 1D regression task with AdamW (left) and IVON (right). We see that IVON produces sensible uncertainty estimation whereas AdamW fails to do so.

You can try out this example with this Google Colab notebook.

IVON redefines the NN training

I know, I know, this is a bold claim. But this is actually the part I am the most excited about. Allow me to explain.

On top of the mean, the variance from the Gaussian distribution learned by IVON provides a new set of information. Our paper shows that this additional information can be used to achieve many cool things:

  • Sensitivity analysis: we can identify the most influential samples, and less important data can also be spotted and potentially pruned.
  • Model merging: we can easily merge IVON-trained models that are fine-tuned on different datasets to obtain an improved or edited model without retraining.
  • Data-free validation: we can directly estimate leave-one-out cross-validation loss, so early stopping can be done without a separate validation set and all data can be used for training.
  • Uncertainty-awareness: the model can produce sensible uncertainty, and produce low confidence for unknown out-of-distribution samples, saying “I don’t know” instead of wrong prediction with high confidence.

We have discussed all these points in detail in our paper.

Most and least sensitive shark images during an ImageNet training with IVON.

Note that all these are possible at any time of the training process, since the Gaussian distribution is constantly tracked by IVON!

These new possibilities lead me to rethink how neural networks can be trained, and to realize that simple optimizers like SGD and Adam are probably not the end of the story.

How to use IVON in your projects?

Wanna try out the IVON optimizer? We make it easy for you! :wink:

Installation

We provide an official PyTorch implementation of IVON as a PyPI package. You can install it with the command

pip install ivon-opt

Setting up the optimizer

Our ivon-opt package simply contains an IVON class which is a PyTorch optimizer (inherits from torch.optim.Optimizer). We can import and instantiate it as follows:

import ivon

# define your PyTorch model here
model = ...

# define the IVON optimizer like this
optimizer = ivon.IVON(
    model.parameters(),
    lr=...,
    ess=...,
    hess_init = ...,
    beta1 = ...,
    beta2 = ...,
    weight_decay = ...,
    ...
)

Similar to Adam, IVON also has hyperparameters such as learning rate (lr), weight decay, and two momentums (beta1, beta2). Additionally, it asks for the effective sample size (ess) to have a rough idea of how many training samples are out there, as well as an optional hess_init argument to adjust the initial variance of the Gaussian it is tracking.

Some quick tips for setting up the arguments of IVON:

  • lr: should typically be higher than SGD;
  • ess: setting to the sample size of the training set is a good start;
  • hess_init: typically 0.01 ~ 1 works well. Too small values lead to too noisy weight samples and unstable training, whereas too large ones lead to near deterministic behavior;
  • beta1: typical gradient momentum, 0.9 is a good start;
  • beta2: should be rather close to 1 (e.g. 0.99999), otherwise instabilities might occur;
  • weight_decay: similar to the values used for SGD/Adam.

We have provided a detailed guide in the appendix of our paper for setting up these hyperparameters of IVON, feel free to check it out!

Training with IVON

Training with IVON is largely the same as the usual routine with SGD/Adam, except for an additional weight sampling step. for IVON to do Bayesian learning, it needs to collect sample gradients from weights drawn from the current estimated Gaussian weight distribution.

This can be achieved by wrapping the feed-forward/backward step with a Python context method from IVON, .sampled_params(train=True), which temporarily injects a weight sample into the model. A typical training step with IVON looks like the following:

with optimizer.sampled_params(train=True):
    optimizer.zero_grad()
    train_output_sample = model(train_input)
    loss_sample = lossfn(train_output_sample, target)
    loss_sample.backward()
optimizer.step()

Note that the argument train=True is necessary, as it tells IVON to collect the sample gradient when exiting the context.

Additionally, as we have shown in our paper, if you are willing to spend extra computational budget, you can even draw multiple samples to further improve the training. This can be easily achieved with a for loop:

for _ in range(train_mc_samples):
    with optimizer.sampled_params(train=True):
        optimizer.zero_grad()
        train_output_sample = model(train_input)
        loss_sample = lossfn(train_output_sample, target)
        loss_sample.backward()
optimizer.step()

Running evaluations

There are two ways to do inference with models trained by IVON.

1. Bayesian prediction

This is the inference method that gets the full benefit out of the Bayesian modeling of IVON. In this case, we average the probabilistic predictions from multiple weight samples drawn from the Gaussian distribution learned by IVON.

Our paper shows that increasing sample size leads to improved performance, and typically around 16~64 weight samples are sufficient to obtain good results.

To implement Bayesian prediction, we again use the sample context .sampled_params from the ivon optimizer:

output_samples = []
for _ in range(eval_mc_samples):
    with optimizer.sampled_params():
        eval_output_sample = model(eval_input)
    output_samples.append(eval_output_sample)

You can then average the prediction samples to get the final probabilistic prediction, which for classification tasks can be implemented as follows3:

# log probability of BMA prediction + log(num_classes)
bma_logits = torch.logsumexp(
    # stacked sample log probabilities
    torch.log_softmax(
        torch.stack(output_samples, axis=0), 
        axis=-1,
    ),  
    axis=0,
) 

2. Fast estimate at the mean

If you need fast inference speed. You can also directly use the mean of the Gaussian weight distribution. This approach performs the so-called MAP estimation. In this case, as the mean is stored in your model, you can do the inference the usual way without drawing any weight sample:

eval_output = model(eval_input)

This point estimate at the mean also produces good test-time predictions. However, it tends to have sub-optimal uncertainty estimation similar to SGD/Adam.

Examples using IVON

Wanna quickly try out IVON? On top of the two Google Colab notebooks on the 2D loss landscape and 1D regression examples, we have also provided another Colab notebook on MNIST image classification. This MNIST example should give you a concrete idea of how IVON is used in typical deep-learning setups.

Also, we will release the source code that reproduces the experiments in our paper, stay tuned!

How about JAX …

Great choice :wink: I got your back! I have also implemented IVON with JAX, checkout this github repository.

Conclusion

I hope that this post has piqued your interest in our IVON optimizer. Please feel free to try it in your projects and let us know what you think!

Until next time :wink:

  1. This overhead can be reduced further, since the current implementation is not fully optimized for best performance. We are working on it now. 

  2. An improved variant of Adam that we use for Adam baselines. 

  3. Computations are done in log domain for better numerical precision.