In this post, I present an example of estimating the Kullback-Leibler (KL) divergence between two continuous distributions using the Monte Carlo technique. Whoa! Just stating the problem has a massive amount of information.

The KL divergence is the key part of the ELBO (evidence lower bound) loss function, which in turn is the key to the creation of Variational Autoencoder (VAE) neural systems that generate synthetic data.

You can tell right away that there are dozens of interrelated topics here -- individually they are moderately complicated, but when you combine all of them, things get fantastically complicated. This is the general challenge in all machine learning techniques.

A good way to think about what a continuous distribution for the context of this post is focusing on one specific example: the Normal (aka Gaussian) distribution is characterized completely by a mean (mu, average value) and a standard deviation (sd, sigma, measure of spread). If you graph the Normal distribution you get a bell-shaped curve with X from -infinity to +infinity -- any X value is possible (hence "continuous") but if you pick an X randomly, it will almost always be between -4 * sigma and +4 * sigma. The curve is called the probability density function (PDF). The height of the PDF curve at a value x is not the probability of x but is a measure of how likely x is.

The Kullback-Leibler divergence is a number that describes how different two distributions (P and Q) are. If KL(P, Q) = 0, the two distributions are the same. The larger the value of KL(P, Q), the greater the difference between P and Q. Note that because of the way KL is defined mathematically, KL(P, Q) is not necessarily equal to KL(Q, P) -- KL is not symmetric. KL can be defined for continuous distributions (the topic of this post) or for discrete distributions (a different topic).

The Monte Carlo technique loosely means "an estimate computed by simulation". Suppose you wanted to estimate the probability of getting Heads if you flip a specific coin. You could flip the coin 100 times. If you got 53 Heads (and 47 Tails) your estimate of pr(Heads) = 53/100 = 0.53.

The demo has three parts. The first part sets up P and Q and illustrates key concepts. The second part computes a KL(P, Q) divergence using the Monte Carlo technique. The third part computes KL for the same P and Q using a magic math equation.

The first part sets up P as a Normal distribution with mean = 0.0 and standard deviation = 1.0 -- this is the Standard Normal distribution. A second distribution Q with mean = 3.0 and standard deviation = 1.4 is created. Next a value z = 2.5 is set:

  p_mu = T.tensor([0.0], dtype=T.float32)  p_sd = T.tensor([1.0], dtype=T.float32)    q_mu = T.tensor([3.0], dtype=T.float32)  q_sd = T.tensor([1.4], dtype=T.float32)    P = T.distributions.Normal(p_mu, p_sd)  Q = T.distributions.Normal(q_mu, q_sd)    print("\nSetting z = 2.5 ")  z = T.tensor([2.5], dtype=T.float32)  

I use PyTorch tensors but the exact same ideas work with NumPy arrays too.

Next, the demo computes two PDF values:

  log_p_df_z = P.log_prob(z)  p_df_z = T.exp(log_p_df_z).item()    log_q_df_z = Q.log_prob(z)  q_df_z = T.exp(log_q_df_z).item()    print("Semi-likelihood z comes from P = %0.4f " % p_df_z)  print("Semi-likelihood z comes from Q = %0.4f " % q_df_z)  

The P.log_prob(z) function gives the log of the value of the P = N(0,1) distribution PDF at z = 2.5. The log() is used because 1.) raw PDF values are often extremely small, and 2.) log() values are much easier to work with. The minor downside is that log() values are negative. I use T.exp() to convert the values from log to raw. I write "semi-likelihood" to emphasize the values aren't true probabilities.

The likelihood value that z = 2.5 came from P = N(0,1) is 0.0175 and the likelihood z came from Q = N(3, 1.4) is 0.2674. In other words, it's more likely z came from Q than from P. This makes sense because z is much closer to the mean of Q than it is to the mean of P.

The second part of the demo uses the concepts to estimate KL(P,Q) using the Monte Carlo technique:

  n_samples = 100    print("\nEstimating KL(P,Q) using %d samples " % n_samples)  sum = 0.0  for i in range(n_samples):    z = Q.rsample()    kl = Q.log_prob(z) - P.log_prob(z)  # magic math    sum += kl.item()    KL = sum / n_samples  print("Estimated Monte Carlo KL = %0.4f " % KL)  

The code iterates 100 times. In each iteration a random value z is drawn from the Q distribution using the rsample() method. The PDF likelihood values that z comes from P and from Q are computed (in log form). The magic math is that the KL divergence is the difference between the log_pdf values. The 100 estimated KL(P, Q) values are summed and then the average is computed. The result is 4.3975 which indicates the two distributions aren't very close.

The last part of the demo uses some fancy math to estimate KL(P,Q). It turns out that there's a special case when P is standard Normal ( mean = 0 and sd = 1). In this special case, KL(P,Q) can be computed directly like so:

  print("\nEstimating KL(P,Q) using special N(0,1) math ")  q_logvar = T.log(q_sd.pow(2))  KL = -0.5 * T.sum(1 + q_logvar - q_mu.pow(2) - q_logvar.exp())  print("Special case computed KL = %0.4f " % KL)  

This is not at all obvious and is one of the many "magic math" equations in machine learning. Because of this very nice property, systems that need to compute KL for two distributions often assume one or both are N(0,1) even when they might not be -- the assumption makes calculations much, much easier.

In machine learning, computing KL divergence between two distributions is most often seen in variational autoencoders, but the technique is used in several other problem scenarios too, such as Bayesian neural networks where each network weight is a Normal distribution instead of a single fixed value.



I've always thought that an effective travel poster is one that finds the right distance between reality and a happy fantasy ideal. Here are three old Hawaii travel posters like the ones that contributed to a huge increase in Hawaii tourism starting in the late 1950s. Left: By artist Bern Hill. Center: By artist Paul Lawler. Right: By artist Joseph Feher.


Complete demo code:

  # kl_estimate_demo.py    # import numpy as np  import torch as T    print("\nBegin KL continuous Monte Carlo demo ")  T.manual_seed(0)    print("\nCreating P = (N, 0, 1) and Q = (N, 3, 1.4) ")  p_mu = T.tensor([0.0], dtype=T.float32)  p_sd = T.tensor([1.0], dtype=T.float32)    q_mu = T.tensor([3.0], dtype=T.float32)  q_sd = T.tensor([1.4], dtype=T.float32)    P = T.distributions.Normal(p_mu, p_sd)  Q = T.distributions.Normal(q_mu, q_sd)    print("\nSetting z = 2.5 ")  z = T.tensor([2.5], dtype=T.float32)    log_p_df_z = P.log_prob(z)  p_df_z = T.exp(log_p_df_z).item()    log_q_df_z = Q.log_prob(z)  q_df_z = T.exp(log_q_df_z).item()    print("Semi-likelihood z comes from P = %0.4f " % p_df_z) # .0175  print("Semi-likelihood z comes from Q = %0.4f " % q_df_z) # .2674    n_samples = 100    print("\nEstimating KL(P,Q) using %d samples " % n_samples)  sum = 0.0  for i in range(n_samples):    z = Q.rsample()    kl = Q.log_prob(z) - P.log_prob(z)  # magic math    sum += kl.item()    KL = sum / n_samples  print("Estimated Monte Carlo KL = %0.4f " % KL)    print("\nEstimating KL(P,Q) using special N(0,1) math ")  # KLD = 0.5 * sum(1 + log(sigma^2) - u^2 - sigma^2)  q_logvar = T.log(q_sd.pow(2))  KL = -0.5 * T.sum(1 + q_logvar - q_mu.pow(2) - q_logvar.exp())  print("Special case computed KL = %0.4f " % KL)    print("\nEnd demo ")