A fairly common sub-problem in machine learning and data science scenarios is computing the distance(or difference or similarity) between two datasets. At first thought, this sounds easy but in fact the problem is extemely difficult. If you try to compare individual lines between datasets, you run into the combinatorial explosion problem — there are just too many comparisons. There are also the problems of dealing with different dataset sizes, and dealing with non-numeric data. The bottom line is that there is no simple way to calculate the distance between two datasets.

Several months ago, I came up with an interesting technique for the distance between datasets P and Q. First, transform each data line of P and Q into an embedded vector representation using a deep neural autoencoder. For example, if you set the embedding dim to 3, a line of data like (male, 29, developer, $82,000.00, Anaheim, 0.8827) might be transformed into a vector like [0.7216, 0.8911, 0.2336].

Next, each dataset representation determines a frequency distribution for each latent variable. For example:

P 1: [0.10, 0.00, 0.05, 0.05, 0.20, 0.10, 0.00, 0.30, 0.10, 0.10]
Q 1: [0.05, 0.10, 0.00, 0.15, 0.20, 0.10, 0.20, 0.10, 0.00, 0.10]

This means that for latent varaible 1 (of the three), in dataset P, 10% of the data items are between 0.00 and 0.10, 0% are between 0.10 and 0.20, 5% are between 0.20 and 0.30, and so on. If there were 3 latent variables then there'd also be a P2, Q2 pair and a P3 Q3 pair.

In my original thought, I figured to use the Kullback-Leibler divergence to compute the difference between the P and Q frequency distributions. Then later I experimented with using Jensen-Shannon distance. Finally, based on the weaknesses of KL and JS, I experimented with using a highly simplified version of the Wasserstein distance. So the idea is to compute the three distances between the three different P and Q distributions using Wasserstein. And last, the average of the three Wasserstein distances gives the final distance between P and Q.

To test this idea, I coded it up using PyTorch. Then I created a reference dataset P that is 100 lines of the UCI Digits dataset. Each line represents a crude 8×8 handwritten digit and has 64 pixel values between 0 and 16 followed by the class label between 0 and 9. Then I made 10 different Q datasets where each is based on P but with a different percentage of lines where the values have been assigned randomly — 0%, 10%, 20% . . 100%. The idea is that if P and Q are the same, the distance should be 0, and as the percentage of randomization increases, the distance should increase.

My experiment worked very nicely and I was pleased.

Dataset distance can be used in several ways. For example, if you are taking a sample of data from a large dataset, you can use dataset distance to measure how close the sample is to the parent population.



The alien project (see alien-project.org) is an artificial life simulation that allows users to create digital organisms that evolve in artificial ecosystems. Artificial life is one of the topics that fostered my love for computer science. Ultimately, artificial life has more value as art than as science because the distance between artificial and real is too large, but that's OK.