I was reviewing some research papers that had been submitted to an internal conference at the tech company I work for. I was the Area Chair for the Unsupervised and Semi-Supervised Learning track of the conference. One of the research papers used an old technique for anomaly detection -- principal component analysis (PCA) reconstruction error.

PCA is a classical statistics technique. PCA accepts some numeric data, decomposes the data, and computes some matrices and vectors that are a reduced representation of the source data. If you decompose the source data, and then reconstruct the data using only part of the internal representation, you get reconstructed data that is close to, but not exactly the same as, the original source data. Then if you compare the reconstructed data items with the original source items, the item that differ the most (reconstruction error) are the ones that are most likely to be anomalies.

Anyway, it had been quite some time since I implemented a system for anomaly detection using PCA, so I decided to refresh my memory by coding up a small demo.

I set up six source items that were selected from the well-known Iris Dataset -- two "setosa", two "versicolor", and two "virginica" items:

  [[ 5.1  3.5  1.4  0.2]   [ 5.4  3.9  1.7  0.4]     [ 6.4  3.2  4.5  1.5]   [ 5.7  2.8  4.5  1.3]     [ 7.2  3.6  6.1  2.5]   [ 6.9  3.2  5.7  2.3]]  

The Iris Dataset is convenient because the data is already somewhat normalized. When using PCA it's important to normalize the source data so that items with large values don't swamp items with small values.

When using PCA for reconstruction, if you use all of the computed "principal components", you will recreate the source data exactly, which isn't very useful. The Iris data has four columns so PCA computes four principal components. For my demo, the first principal component accounted for 96% of the total variance so I used just that first component to reconstruct the data. The PCA reconstructed data is:

  [[ 5.08  3.56  1.43  0.16]   [ 5.22  3.54  1.78  0.33]     [ 6.32  3.33  4.49  1.61]   [ 6.22  3.35  4.24  1.49]     [ 7.01  3.20  6.19  2.41]   [ 6.84  3.23  5.77  2.21]]  

The squared errors between each of the six source items and their reconstructions are:

  [ 0.01  0.17  0.03  0.67  0.22  0.02]  

The largest error is 0.67 for the fourth item so it is the most anomalous.

PCA anomaly detection can be useful but the technique is relatively crude compared to techniques such as deep neural autoencoder reconstruction error.



Tiki restaurants and bars are a reconstruction of an idealized Polynesian environment. Left: The Tonga Room restaurant, located inside the Fairmont Hotel in San Francisco. Right: The Golden Tiki bar, in Las Vegas.


Code:

  # iris_pca_recon.py  # Iris data PCA reconstruction error    import numpy as np    def my_pca(X):    # returns transformed X, prin components, var explained    dim = len(X[0])  # n_cols    means = np.mean(X, axis=0)    z = X - means  # avoid changing X    square_m = np.dot(z.T, z)    (evals, evecs) = np.linalg.eig(square_m)  # 'right-hand'    trans_x = np.dot(z, evecs[:,0:dim])     prin_comp = evecs.T     v = np.var(trans_x, axis=0, ddof=1)  # sample var    sv = np.sum(v)    ve = v / sv    # order everything based on variance explained    ordering = np.argsort(ve)[::-1]  # sort order high to low    trans_x = trans_x[:,ordering]    prin_comp = prin_comp[ordering,:]    ve = ve[ordering]    return (trans_x, prin_comp, ve)    def reconstructed(X, n_comp, trans_x, p_comp):    means = np.mean(X, axis=0)    result = np.dot(trans_x[:,0:n_comp], p_comp[0:n_comp,:])    result += means    return result    def recon_error(X, XX):    diff = X - XX    diff_sq = diff * diff    errs = np.sum(diff_sq, axis=1)    return errs    def main():    print("\nBegin Iris PCA reconstruction error ")    np.set_printoptions(precision=1, suppress=True, sign=" ")      X = np.array([      [5.1, 3.5, 1.4, 0.2],      [5.4, 3.9, 1.7, 0.4],        [6.4, 3.2, 4.5, 1.5],      [5.7, 2.8, 4.5, 1.3],        [7.2, 3.6, 6.1, 2.5],      [6.9, 3.2, 5.7, 2.3]])      print("\nSource X: ")    print(X)      print("\nPerforming PCA computations ")    (trans_x, p_comp, ve) = my_pca(X)    print("Done ")      print("\nTransformed X: ")    np.set_printoptions(formatter={'float': '{: 0.4f}'.format})    print(trans_x)      print("\nPrincipal components: ")    np.set_printoptions(formatter={'float': '{: 0.4f}'.format})    print(p_comp)      print("\nVariance explained: ")    np.set_printoptions(formatter={'float': '{: 0.5f}'.format})     print(ve)      XX = reconstructed(X, 4, trans_x, p_comp)    print("\nReconstructed X using all components: ")    np.set_printoptions(formatter={'float': '{: 0.2f}'.format})    print(XX)      XX = reconstructed(X, 1, trans_x, p_comp)    print("\nReconstructed X using one component: ")    np.set_printoptions(formatter={'float': '{: 0.2f}'.format})    print(XX)      re = recon_error(X, XX)    print("\nReconstruction errors using one component: ")    np.set_printoptions(formatter={'float': '{: 0.2f}'.format})    print(re)      print("\nEnd PCA reconstruction error ")    if __name__ == "__main__":    main()