A few days ago I coded up a demo of anomaly detection using principal component analysis (PCA) reconstruction error. I implemented the PCA functionality -- computation of the transformed data, the principal components, and the variance explained by each component -- from semi-scratch, meaning I used the NumPy linalg (linear algebra) library eig() function to compute eigenvalues and eigenvectors.

And it was good.

But in the back of my mind, I was thinking that I should have verified my semi-from-scratch implementation of PCA because PCA is very, very complex and I could have made a mistake.


The from-scratch version (left) and the scikit version (right) are identical except that some of the transformed vectors and principal components differ by a factor of -1. This doesn't affect anything.

So I took my original from-scratch PCA anomaly detection program and swapped out the PCA implementation from the scikit sklearn.decomposition library. And as expected, the results of the scikit-based PCA program were identical to the results of the from-scratch PCA program. Almost.

My from-scratch code looks like:

  import numpy as np    def my_pca(X):    # returns transformed X, prin components, var explained    dim = len(X[0])  # n_cols    means = np.mean(X, axis=0)    z = X - means  # avoid changing X    square_m = np.dot(z.T, z)    (evals, evecs) = np.linalg.eig(square_m)     trans_x = np.dot(z, evecs[:,0:dim])     prin_comp = evecs.T      v = np.var(trans_x, axis=0, ddof=1)     sv = np.sum(v)    ve = v / sv    # order everything based on variance explained    ordering = np.argsort(ve)[::-1]  # sort order high to low    trans_x = trans_x[:,ordering]    prin_comp = prin_comp[ordering,:]    ve = ve[ordering]    return (trans_x, prin_comp, ve)    X = (load data from somewhere)  (trans_x, p_comp, ve) = my_pca(X)  

The scikit-based code looks like:

  import numpy as np  import sklearn.decomposition    X = (load data from somewhere)  pca = sklearn.decomposition.PCA().fit(X)  trans_x = pca.transform(X)  p_comp = pca.components_  ve = pca.explained_variance_ratio_  

All the results were identical except that the internal transformed X values and the principal components, sometimes differed by a factor of -1. As it turns out this is OK because PCA computes variances and the sign doesn't affect variance.

The advantage of using scikit PCA is simplicity. The advantages of using PCA from scratch are 1.) you get fine-tuned control, 2.) you remove an external dependency, 3.) you aren't using a mysterious black box.

PCA is interesting and sometimes useful, but for tasks like dimensionality reduction and reconstruction, deep neural techniques have largely replaced PCA.



PCA was developed in 1901 by famous statistician Karl Pearson. I wonder if statisticians of that era imagined today's deep neural technologies. Three images from the movie "Things to Come" (1936) based on the novel of the same name by author H. G. Wells.


Demo code:

  # pca_recon_skikit.py  # exactly replicates iris_pca_recon.py scratch version    import numpy as np  import sklearn.decomposition    def reconstructed(X, n_comp, trans_x, p_comp):    means = np.mean(X, axis=0)    result = np.dot(trans_x[:,0:n_comp], p_comp[0:n_comp,:])    result += means    return result    def recon_error(X, XX):    diff = X - XX    diff_sq = diff * diff    errs = np.sum(diff_sq, axis=1)    return errs    def main():    print("\nBegin Iris PCA reconstruction using scikit ")    np.set_printoptions(formatter={'float': '{: 0.1f}'.format})      X = np.array([      [5.1, 3.5, 1.4, 0.2],      [5.4, 3.9, 1.7, 0.4],        [6.4, 3.2, 4.5, 1.5],      [5.7, 2.8, 4.5, 1.3],        [7.2, 3.6, 6.1, 2.5],      [6.9, 3.2, 5.7, 2.3]])      print("\nSource X: ")    print(X)      print("\nPerforming PCA computations ")    pca = sklearn.decomposition.PCA().fit(X)    trans_x = pca.transform(X)    p_comp = pca.components_    ve = pca.explained_variance_ratio_    print("Done ")      print("\nTransformed X: ")    np.set_printoptions(formatter={'float': '{: 0.4f}'.format})    print(trans_x)      print("\nPrincipal components: ")    np.set_printoptions(formatter={'float': '{: 0.4f}'.format})    print(p_comp)      print("\nVariance explained: ")    np.set_printoptions(formatter={'float': '{: 0.5f}'.format})     print(ve)      XX = reconstructed(X, 4, trans_x, p_comp)    print("\nReconstructed X using all components: ")    np.set_printoptions(formatter={'float': '{: 0.2f}'.format})    print(XX)      XX = reconstructed(X, 1, trans_x, p_comp)    print("\nReconstructed X using one component: ")    np.set_printoptions(formatter={'float': '{: 0.2f}'.format})    print(XX)      re = recon_error(X, XX)    print("\nReconstruction errors using one component: ")    np.set_printoptions(formatter={'float': '{: 0.3f}'.format})    print(re)      print("\nEnd PCA scikit ")    if __name__ == "__main__":    main()