A few days ago I coded up a demo of anomaly detection using principal component analysis (PCA) reconstruction error. I implemented the PCA functionality -- computation of the transformed data, the principal components, and the variance explained by each component -- from semi-scratch, meaning I used the NumPy linalg (linear algebra) library eig() function to compute eigenvalues and eigenvectors.
And it was good.
But in the back of my mind, I was thinking that I should have verified my semi-from-scratch implementation of PCA because PCA is very, very complex and I could have made a mistake.
The from-scratch version (left) and the scikit version (right) are identical except that some of the transformed vectors and principal components differ by a factor of -1. This doesn't affect anything.
So I took my original from-scratch PCA anomaly detection program and swapped out the PCA implementation from the scikit sklearn.decomposition library. And as expected, the results of the scikit-based PCA program were identical to the results of the from-scratch PCA program. Almost.
My from-scratch code looks like:
import numpy as np def my_pca(X): # returns transformed X, prin components, var explained dim = len(X[0]) # n_cols means = np.mean(X, axis=0) z = X - means # avoid changing X square_m = np.dot(z.T, z) (evals, evecs) = np.linalg.eig(square_m) trans_x = np.dot(z, evecs[:,0:dim]) prin_comp = evecs.T v = np.var(trans_x, axis=0, ddof=1) sv = np.sum(v) ve = v / sv # order everything based on variance explained ordering = np.argsort(ve)[::-1] # sort order high to low trans_x = trans_x[:,ordering] prin_comp = prin_comp[ordering,:] ve = ve[ordering] return (trans_x, prin_comp, ve) X = (load data from somewhere) (trans_x, p_comp, ve) = my_pca(X)
The scikit-based code looks like:
import numpy as np import sklearn.decomposition X = (load data from somewhere) pca = sklearn.decomposition.PCA().fit(X) trans_x = pca.transform(X) p_comp = pca.components_ ve = pca.explained_variance_ratio_
All the results were identical except that the internal transformed X values and the principal components, sometimes differed by a factor of -1. As it turns out this is OK because PCA computes variances and the sign doesn't affect variance.
The advantage of using scikit PCA is simplicity. The advantages of using PCA from scratch are 1.) you get fine-tuned control, 2.) you remove an external dependency, 3.) you aren't using a mysterious black box.
PCA is interesting and sometimes useful, but for tasks like dimensionality reduction and reconstruction, deep neural techniques have largely replaced PCA.
PCA was developed in 1901 by famous statistician Karl Pearson. I wonder if statisticians of that era imagined today's deep neural technologies. Three images from the movie "Things to Come" (1936) based on the novel of the same name by author H. G. Wells.
Demo code:
# pca_recon_skikit.py # exactly replicates iris_pca_recon.py scratch version import numpy as np import sklearn.decomposition def reconstructed(X, n_comp, trans_x, p_comp): means = np.mean(X, axis=0) result = np.dot(trans_x[:,0:n_comp], p_comp[0:n_comp,:]) result += means return result def recon_error(X, XX): diff = X - XX diff_sq = diff * diff errs = np.sum(diff_sq, axis=1) return errs def main(): print("\nBegin Iris PCA reconstruction using scikit ") np.set_printoptions(formatter={'float': '{: 0.1f}'.format}) X = np.array([ [5.1, 3.5, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], [6.4, 3.2, 4.5, 1.5], [5.7, 2.8, 4.5, 1.3], [7.2, 3.6, 6.1, 2.5], [6.9, 3.2, 5.7, 2.3]]) print("\nSource X: ") print(X) print("\nPerforming PCA computations ") pca = sklearn.decomposition.PCA().fit(X) trans_x = pca.transform(X) p_comp = pca.components_ ve = pca.explained_variance_ratio_ print("Done ") print("\nTransformed X: ") np.set_printoptions(formatter={'float': '{: 0.4f}'.format}) print(trans_x) print("\nPrincipal components: ") np.set_printoptions(formatter={'float': '{: 0.4f}'.format}) print(p_comp) print("\nVariance explained: ") np.set_printoptions(formatter={'float': '{: 0.5f}'.format}) print(ve) XX = reconstructed(X, 4, trans_x, p_comp) print("\nReconstructed X using all components: ") np.set_printoptions(formatter={'float': '{: 0.2f}'.format}) print(XX) XX = reconstructed(X, 1, trans_x, p_comp) print("\nReconstructed X using one component: ") np.set_printoptions(formatter={'float': '{: 0.2f}'.format}) print(XX) re = recon_error(X, XX) print("\nReconstruction errors using one component: ") np.set_printoptions(formatter={'float': '{: 0.3f}'.format}) print(re) print("\nEnd PCA scikit ") if __name__ == "__main__": main()
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.