.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "packages/scikit-learn/auto_examples/plot_eigenfaces.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_packages_scikit-learn_auto_examples_plot_eigenfaces.py: The eigenfaces example: chaining PCA and SVMs ============================================= The goal of this example is to show how an unsupervised method and a supervised one can be chained for better prediction. It starts with a didactic but lengthy way of doing things, and finishes with the idiomatic approach to pipelining in scikit-learn. Here we'll take a look at a simple facial recognition example. Ideally, we would use a dataset consisting of a subset of the `Labeled Faces in the Wild `__ data that is available with :func:`sklearn.datasets.fetch_lfw_people`. However, this is a relatively large download (~200MB) so we will do the tutorial on a simpler, less rich dataset. Feel free to explore the LFW dataset. .. GENERATED FROM PYTHON SOURCE LINES 17-22 .. code-block:: default from sklearn import datasets faces = datasets.fetch_olivetti_faces() faces.data.shape .. rst-class:: sphx-glr-script-out .. code-block:: none downloading Olivetti faces from https://ndownloader.figshare.com/files/5976027 to /opt/buildhome/scikit_learn_data (400, 4096) .. GENERATED FROM PYTHON SOURCE LINES 23-24 Let's visualize these faces to see what we're working with .. GENERATED FROM PYTHON SOURCE LINES 24-32 .. code-block:: default import matplotlib.pyplot as plt fig = plt.figure(figsize=(8, 6)) # plot several images for i in range(15): ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[]) ax.imshow(faces.images[i], cmap=plt.cm.bone) .. image-sg:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_001.png :alt: plot eigenfaces :srcset: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 33-47 .. tip:: Note is that these faces have already been localized and scaled to a common size. This is an important preprocessing piece for facial recognition, and is a process that can require a large collection of training data. This can be done in scikit-learn, but the challenge is gathering a sufficient amount of training data for the algorithm to work. Fortunately, this piece is common enough that it has been done. One good resource is `OpenCV `__, the *Open Computer Vision Library*. We'll perform a Support Vector classification of the images. We'll do a typical train-test split on the images: .. GENERATED FROM PYTHON SOURCE LINES 47-54 .. code-block:: default from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=0) print(X_train.shape, X_test.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (300, 4096) (100, 4096) .. GENERATED FROM PYTHON SOURCE LINES 55-61 Preprocessing: Principal Component Analysis ------------------------------------------- 1850 dimensions is a lot for SVM. We can use PCA to reduce these 1850 features to a manageable size, while maintaining most of the information in the dataset. .. GENERATED FROM PYTHON SOURCE LINES 61-66 .. code-block:: default from sklearn import decomposition pca = decomposition.PCA(n_components=150, whiten=True) pca.fit(X_train) .. raw:: html

PCA(n_components=150, whiten=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 67-69 One interesting part of PCA is that it computes the "mean" face, which can be interesting to examine: .. GENERATED FROM PYTHON SOURCE LINES 69-73 .. code-block:: default plt.imshow(pca.mean_.reshape(faces.images[0].shape), cmap=plt.cm.bone) .. image-sg:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_002.png :alt: plot eigenfaces :srcset: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 74-76 The principal components measure deviations about this mean along orthogonal axes. .. GENERATED FROM PYTHON SOURCE LINES 76-79 .. code-block:: default print(pca.components_.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (150, 4096) .. GENERATED FROM PYTHON SOURCE LINES 80-81 It is also interesting to visualize these principal components: .. GENERATED FROM PYTHON SOURCE LINES 81-88 .. code-block:: default fig = plt.figure(figsize=(16, 6)) for i in range(30): ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[]) ax.imshow(pca.components_[i].reshape(faces.images[0].shape), cmap=plt.cm.bone) .. image-sg:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_003.png :alt: plot eigenfaces :srcset: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 89-96 The components ("eigenfaces") are ordered by their importance from top-left to bottom-right. We see that the first few components seem to primarily take care of lighting conditions; the remaining components pull out certain identifying features: the nose, eyes, eyebrows, etc. With this projection computed, we can now project our original training and test data onto the PCA basis: .. GENERATED FROM PYTHON SOURCE LINES 96-100 .. code-block:: default X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test) print(X_train_pca.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (300, 150) .. GENERATED FROM PYTHON SOURCE LINES 101-103 .. code-block:: default print(X_test_pca.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (100, 150) .. GENERATED FROM PYTHON SOURCE LINES 104-113 These projected components correspond to factors in a linear combination of component images such that the combination approaches the original face. Doing the Learning: Support Vector Machines ------------------------------------------- Now we'll perform support-vector-machine classification on this reduced dataset: .. GENERATED FROM PYTHON SOURCE LINES 113-118 .. code-block:: default from sklearn import svm clf = svm.SVC(C=5., gamma=0.001) clf.fit(X_train_pca, y_train) .. raw:: html

SVC(C=5.0, gamma=0.001)

.. GENERATED FROM PYTHON SOURCE LINES 119-122 Finally, we can evaluate how well this classification did. First, we might plot a few of the test-cases with the labels learned from the training set: .. GENERATED FROM PYTHON SOURCE LINES 122-133 .. code-block:: default import numpy as np fig = plt.figure(figsize=(8, 6)) for i in range(15): ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[]) ax.imshow(X_test[i].reshape(faces.images[0].shape), cmap=plt.cm.bone) y_pred = clf.predict(X_test_pca[i, np.newaxis])[0] color = ('black' if y_pred == y_test[i] else 'red') ax.set_title(y_pred, fontsize='small', color=color) .. image-sg:: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_004.png :alt: 13, 30, 34, 19, 24, 6, 15, 26, 14, 21, 3, 13, 11, 34, 1 :srcset: /packages/scikit-learn/auto_examples/images/sphx_glr_plot_eigenfaces_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 134-143 The classifier is correct on an impressive number of images given the simplicity of its learning model! Using a linear classifier on 150 features derived from the pixel-level data, the algorithm correctly identifies a large number of the people in the images. Again, we can quantify this effectiveness using one of several measures from :mod:`sklearn.metrics`. First we can do the classification report, which shows the precision, recall and other measures of the "goodness" of the classification: .. GENERATED FROM PYTHON SOURCE LINES 143-148 .. code-block:: default from sklearn import metrics y_pred = clf.predict(X_test_pca) print(metrics.classification_report(y_test, y_pred)) .. rst-class:: sphx-glr-script-out .. code-block:: none precision recall f1-score support 0 1.00 0.50 0.67 6 1 1.00 1.00 1.00 4 2 0.50 1.00 0.67 2 3 1.00 1.00 1.00 1 4 0.33 1.00 0.50 1 5 1.00 1.00 1.00 5 6 1.00 1.00 1.00 4 7 1.00 0.67 0.80 3 9 1.00 1.00 1.00 1 10 1.00 1.00 1.00 4 11 1.00 1.00 1.00 1 12 0.67 1.00 0.80 2 13 1.00 1.00 1.00 3 14 1.00 1.00 1.00 5 15 1.00 1.00 1.00 3 17 1.00 1.00 1.00 6 19 1.00 1.00 1.00 4 20 1.00 1.00 1.00 1 21 1.00 1.00 1.00 1 22 1.00 1.00 1.00 2 23 1.00 1.00 1.00 1 24 1.00 1.00 1.00 2 25 1.00 0.50 0.67 2 26 1.00 0.75 0.86 4 27 1.00 1.00 1.00 1 28 0.67 1.00 0.80 2 29 1.00 1.00 1.00 3 30 1.00 1.00 1.00 4 31 1.00 1.00 1.00 3 32 1.00 1.00 1.00 3 33 1.00 1.00 1.00 2 34 1.00 1.00 1.00 3 35 1.00 1.00 1.00 1 36 1.00 1.00 1.00 3 37 1.00 1.00 1.00 3 38 1.00 1.00 1.00 1 39 1.00 1.00 1.00 3 accuracy 0.94 100 macro avg 0.95 0.96 0.94 100 weighted avg 0.97 0.94 0.94 100 .. GENERATED FROM PYTHON SOURCE LINES 149-153 Another interesting metric is the *confusion matrix*, which indicates how often any two items are mixed-up. The confusion matrix of a perfect classifier would only have nonzero entries on the diagonal, with zeros on the off-diagonal: .. GENERATED FROM PYTHON SOURCE LINES 153-156 .. code-block:: default print(metrics.confusion_matrix(y_test, y_pred)) .. rst-class:: sphx-glr-script-out .. code-block:: none [[3 0 0 ... 0 0 0] [0 4 0 ... 0 0 0] [0 0 2 ... 0 0 0] ... [0 0 0 ... 3 0 0] [0 0 0 ... 0 1 0] [0 0 0 ... 0 0 3]] .. GENERATED FROM PYTHON SOURCE LINES 157-166 Pipelining ---------- Above we used PCA as a pre-processing step before applying our support vector machine classifier. Plugging the output of one estimator directly into the input of a second estimator is a commonly used pattern; for this reason scikit-learn provides a ``Pipeline`` object which automates this process. The above problem can be re-expressed as a pipeline as follows: .. GENERATED FROM PYTHON SOURCE LINES 166-177 .. code-block:: default from sklearn.pipeline import Pipeline clf = Pipeline([('pca', decomposition.PCA(n_components=150, whiten=True)), ('svm', svm.LinearSVC(C=1.0))]) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(metrics.confusion_matrix(y_pred, y_test)) plt.show() .. rst-class:: sphx-glr-script-out .. code-block:: none /opt/buildhome/python3.8/lib/python3.8/site-packages/sklearn/svm/_base.py:1244: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. warnings.warn( [[5 0 0 ... 0 0 0] [0 4 0 ... 0 0 0] [0 0 1 ... 0 0 0] ... [0 0 0 ... 3 0 0] [0 0 0 ... 0 1 0] [0 0 0 ... 0 0 3]] .. GENERATED FROM PYTHON SOURCE LINES 178-187 A Note on Facial Recognition ---------------------------- Here we have used PCA "eigenfaces" as a pre-processing step for facial recognition. The reason we chose this is because PCA is a broadly-applicable technique, which can be useful for a wide array of data types. Research in the field of facial recognition in particular, however, has shown that other more specific feature extraction methods are can be much more effective. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 1 minutes 43.939 seconds) .. _sphx_glr_download_packages_scikit-learn_auto_examples_plot_eigenfaces.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_eigenfaces.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_eigenfaces.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_