Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jun 1;169(6):1013-1028.e14.
doi: 10.1016/j.cell.2017.05.011.

The Code for Facial Identity in the Primate Brain

Affiliations

The Code for Facial Identity in the Primate Brain

Le Chang et al. Cell. .

Abstract

Primates recognize complex objects such as faces with remarkable speed and reliability. Here, we reveal the brain's code for facial identity. Experiments in macaques demonstrate an extraordinarily simple transformation between faces and responses of cells in face patches. By formatting faces as points in a high-dimensional linear space, we discovered that each face cell's firing rate is proportional to the projection of an incoming face stimulus onto a single axis in this space, allowing a face cell ensemble to encode the location of any face in the space. Using this code, we could precisely decode faces from neural population responses and predict neural firing rates to faces. Furthermore, this code disavows the long-standing assumption that face cells encode specific facial identities, confirmed by engineering faces with drastically different appearance that elicited identical responses in single face cells. Our work suggests that other objects could be encoded by analogous metric coordinate systems. PAPERCLIP.

Keywords: decoding; electrophysiology; face processing; inferior temporal cortex; primate vision.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Complementary representation of facial features by AM and ML/MF populations.
A-C, Generation of parameterized face stimuli. A, 58 landmark points were labeled on 200 facial images from a face database (FEI face database; example image shown on left). The positions of these landmarks carry shape information about each facial image (middle). The landmarks were smoothly morphed to match the average landmark positions in the 200 faces, generating an image carrying shape-free appearance information about each face (right). B, Principal component analysis was performed to extract the feature dimensions that account for the largest variability in the database. The 1st principal component for shape (B1) and appearance (B2) are shown. C, Example face stimuli generated by randomly drawing from a face space constructed by the first 25 shape PCs and first 25 appearance PCs. D, Spike-triggered average of a face-selective neuron from anterior face patch AM. The first 25 points represent shape dimensions, and the next 25 represent appearance dimensions. The facial image corresponding to the STA is shown in the inset. E, Vector length of the STA for the 25 appearance dimensions is plotted against that for the 25 shape dimensions for all the cells recorded from middle face patches ML/MF (blue) and anterior face patch AM (red). F, Distribution of shape preference indices, quantified as the contrast between the vector length of shape and appearance STA for ML/MF and AM cells. Arrows indicate the average of each population (p=10−25, Student’s t-test). G, Reliability of the estimated population average of shape preference index for ML/MF and AM (n = 200 iterations of random sampling with replacement). Dash lines indicate 95% confidence intervals. H, Number of significantly tuned cells (p<0.01 by shift predictor, see STAR Methods) for each of 50 dimensions for ML/MF and AM cells in both monkeys. I, Response of a neuron in AM is plotted against distance between the stimulus and the average face along the STA axis. Error bars represent s.e. J, Responses of ML/MF (left) and AM (right) neurons as a function of the distance along STA dimension. The abscissa is rescaled so that the range [−1,1] covers 98% of the stimuli. K, Neuronal responses as a function of the feature value for the 1st shape dimension (top) and 1st appearance dimension (bottom) for all significant cells (p<0.01 by shift predictor).
Figure 2.
Figure 2.. Decoding facial features using linear regression.
A, Diagram illustrating decoding model. To construct and test the model, we used responses of AM (n=99) and ML/MF (n=106) cells to 2000 faces. Population responses to 1999 faces were used to determine the transformation from responses to feature values by linear regression, and then the feature values of the remaining image were predicted. B, Model predictions using AM data are plotted against actual feature values for the first appearance dimension (26th dimension). C, Percentage explained variances for all 50 dimensions using linear regression based on responses of three different neuronal populations: 106 ML/MF cells (blue); 99 AM cells (red); 205 cells combined (black). D, Decoding accuracy as a function of the number of faces randomly drawn from the stimulus set for three different models (see STAR Methods). For each model, different sets of features were first linearly decoded from population responses, and then Euclidean distances between decoded and actual features in each feature space were computed to determine decoding accuracy. The three sets of features are: 50-d features of active appearance model; 25-d shape features; 25-d appearance features. Shaded region indicates s.d. estimated using bootstrapping.
Figure 3.
Figure 3.. Reconstruction of facial images using linear regression.
A, Using facial features decoded by linear regression in Figure 2, facial images could be reconstructed. Predicted faces by three neuronal populations and the corresponding actual stimuli presented in the experiment are shown. B, Decoding accuracy as function of number of faces, using a Euclidean distance model (black solid line). Decoding accuracy based on two alternative models, nearest neighbor in the space of population response (gray dashed line, see STAR Methods) and average of nearest 50 neighbors (gray solid line), were much lower. The black dashed line represents chance level. Results based on three neuronal populations are shown separately (black solid lines for ML/MF and AM are the same as the black solid lines for corresponding patches in Figure 2D, except here they are not shown with variability estimated by bootstrapping). In the left panel, boxes and error bars represent mean and s.e.m. of subjective (human-based) decoding accuracy based on 78 human participants (see STAR Methods: human psychophysics). C, Decoding accuracy for 40 faces plotted against different numbers of cells randomly drawn from three populations (black: all; blue: ML/MF; red: AM). Error bar represents s.d.
Figure 4.
Figure 4.. AM neurons display almost flat tuning along axes orthogonal to the STA in face space.
A, For each neuron in AM, the STA was first computed, then 2000 random axes were selected and orthogonalized to the STA in the 25-d space of appearance features. Tuning functions along 300 axes accounting for the largest variability in the stimuli were averaged and fitted with a Gaussian function (aex2σ2+c). The center of the fit (a + c) was used to normalize the average tuning function. Red dots and error bars represent mean and s.d. of the population. B, Same as A, but for two control models. B1, Each simulated cell corresponds to one of the 200 real faces projected onto the 25-d face space of appearance features (exemplar face), and its response to an arbitrary face is a decreasing linear function of the Euclidean distance between the arbitrary face and the exemplar in the 25-d feature space. B2, Each simulated cell corresponds to 81 transforms of a single identity (9 views*9 positions). For a given image, the similarity of this image to any of the transforms (defined as a decreasing linear function of pixel level distance between the two images) was computed and the maximum value across all 81 transforms was set as the response of the cell. For fairness of comparison, the response of each model cell was matched to one of the AM neurons on noise level and sparseness (for details see STAR Methods, comparison of sparseness between neurons and models is shown in the inset). C, Responses of an AM neuron to 25 parameterized faces. Firing rate was averaged with 25 ms bins. The three stimuli evoking strong responses are shown on the right. D, Responses of the cell in (C) to different faces are color coded and plotted in the 2-D space spanned by the STA axis and the axis orthogonal to STA in the appearance feature space accounting for the largest variability in the features. Arrows indicate three faces in (C). E, Same as D, but for a non-sparse AM cell. F, For each cell in AM or two models, tuning along orthogonal axes was first fitted with a Gaussian function, and the ratio between the fit at 0.67(ae0.672σ2+c) and the center (a + c) and was computed and plotted against the sparseness of the cell. Cells in each population were further divided into three groups according to sparseness (=(i=1NRi/N)2(i=1NRi2/N)). Solid and open circles indicate data from two different monkeys. Boxes and error bars represent mean and s.e. of each subgroup. The difference between AM neurons and two models was significant for all three sparseness levels (p<0.001, Student’s t-test). G, Two models were used to fit face cells’ responses to parameterized face stimuli: 1) an “axis” model where every face was projected onto an axis in the 50-d face space; 2) an “exemplar” model where distance from one of the 2000 faces to an exemplar face was computed (the length of the exemplar face in the 50-d space was restricted to be smaller than twice the average length of real faces). The projection or distance was then passed through a nonlinearity (a third-order polynomial) to generate a predicted response. Each parameter of the model was adjusted using gradient descent to minimize the Euclidean distance between predicted and actual firing rate. To obtain high-quality responses, we repeated 100 faces more frequently than the remaining 1900 faces, and used responses to the 100 faces to validate the model derived from the 1900 faces. H, Predicted versus actual responses for one example cell using an axis model. The model explained 68% of variance in the responses. I, Comparison of fitting quality by two models for 32 cells. An axis model provides significantly better fits to actual responses (mean=56.9%) than an exemplar model (mean=41.7%, p<0.001, paired t-test). J, Different trials of responses to 100 stimuli were randomly split into 2 halves, and the average response across half of the trials was used to predict that of the other half. Percentage variances explained, after Spearman-Brown correction (mean=71.1%), are plotted against that of the axis model. K, A convolutional neural net was trained to perform view-invariant face identification (Figure S7). 52 units were randomly selected from 500 units in the final layer of CNN and were used to linearly fit responses of face cells. Mean explained variance across 100 repetitions of random sampling was plotted against that of the axis model. The fit quality by CNN units was much lower (mean=30.2%, p<0.01) than the axis model. Using more units will lead to overfitting and the validated explained variance will be further reduced (to 26.5% for the case of 100 units and 17.7% for 200 units). L, Neuronal responses were fitted by a different “axis” model using “Eigenface” features as the dimensions of the face space (Figure S2G, see STAR Methods). PCA was performed on the original image intensities of 2000 faces and the first 50 PCs were treated as the input to the axis model. Fitting procedure was the same as shown in (G). The fit quality by “Eigenface” model was much lower (mean=29.9%, p<0.001) than the axis model. Using 100 PCs slightly increased the fit quality (mean=31.1%), while using 200 PCs led to overfitting (mean=22.8%).
Figure 5.
Figure 5.. Responses of AM cells to faces specifically engineered for each cell confirms the axis model.
A, Experimental procedure. After recording the responses of a face cell to 2000 parameterized face stimuli, the STA axis and the principal orthogonal axis were extracted. Facial features were evenly sampled along each axis and a facial image was synthesized for each pair of features (see STAR Methods). The synthesized images were presented to the monkey and responses of the same face cell were recorded. B, The responses of an AM cell to 144 faces evenly sampled from the 2-d space spanned by the STA axis and principal orthogonal axis (c.f. Figure 4D and E), synthesized specifically for this cell, are color coded and plotted (Figure 5A shows a subset of the faces presented to this cell, spanning [−1.2 −0.6 0 0.6 1.2] × [−1.2 −0.6 0 0.6 1.2]). C, The responses of the cell in (B) are plotted against the distance along the STA axis and two orthogonal axes. D, Responses of four more example cells are color coded and plotted. Faces at (−1,−1), (−1,0), (−1,1), (0,−1), (0,1), (1,1), (1,0) and (1,−1) are shown on the periphery. The face at (0,0) is the same for all cells, and shown in Figure 5A. E1, Responses of 22 cells are plotted against the corresponding STA axes (red) and principal orthogonal axes (black). For each cell, the average response to 144 images was normalized to 1. E2, The standard deviations of the projected responses along the orthogonal axes (black in E1) are compared to the STA axes (red in E1), with the latter normalized to 1. On average, the tuning along the orthogonal axes is 8.5% of the tuning along the STA axis, and significantly smaller than 1 (one-sample t-test, p=2*10−34). Boxes and error bars represents mean and s.e.m.
Figure 6.
Figure 6.. The axis coding model is view tolerant.
A, To explore how our model could be extended to other views besides frontal, parameterized faces of right profiles were generated whose main dimensions were conjugate to those of the 2000 frontal faces (see STAR Methods). The first PCs for shape features and appearance features of the profile face space are shown (for more PCs see Figure S2E). B, Spike-triggered average computed using 2000 frontal stimuli (black) and 2000 profile stimuli (red) for an example cell in AM. C, Correlation between profile STA and frontal STA on one single feature dimension (1st appearance dimension) across n=46 AM neurons. Solid and open circles indicate data from two different monkeys. D, STA correlation across cells for all 50 dimensions. Shaded regions indicate 99% confidence intervals of randomly shuffled data. E, Response of the neuron in B is plotted against distances between the stimuli and the average face along the STA axis for frontal and profile stimuli. The distance was rescaled so that STA corresponds to 1. Error bars represent s.e. F, The relationship between face view invariance and feature preference for frontal images across cells. Face-view invariance is quantified as the correlation between frontal and profile STA across 50 dimensions for each neuron. The black line indicates linear fit of the data. G, For the set of 4000 faces, including 2000 frontal and 2000 profile faces, we trained a linear model to predict features on individual dimensions based on population responses of 46 AM cells. Explained variances of all 50 feature dimensions are plotted for frontal and profile faces separately. H, Four reconstructions of profile faces based on the predicted features are shown alongside the corresponding faces presented to the monkey. I, Decoding accuracy as a function of the number of faces (solid lines, similar to Figure 3B), using the model in (G), shown separately for frontal and profile faces (either a frontal or a profile face had to be identified from a number of faces of mixed views).
Figure 7.
Figure 7.. An axis-metric representation is more flexible, efficient, and robust for face identification.
A, An axis metric can perform as well as a distance metric on an identification task for high-dimensional representation but not for low-dimensional representation. A1, For an identification task, a linear classifier is usually non-optimal for a low-dimensional space (upper), e.g., it’s impossible to linearly separate the red dot from the black dots in the same plane, while a circular decision boundary defined by the distance to the red dot could easily perform the task. However, if the representation of these dots were high-dimensional, it’s much easier to separate the dots (lower). A2, We explored how axis and distance metrics (upper) defined on feature spaces of variable dimensions perform in a face identification task. A simple network model (lower) was trained to identify one of the 200 faces based on 200 units with tuning defined by a distance-metric or an axis-metric on a feature space of variable dimensionality. These units used exactly the same 200 faces to be identified to define exemplars/axes of the inputs (red dot in the upper left, and red dashed line in the upper right). A3, Error rate of identification was plotted against dimensionality for both models. B, An axis metric is more efficient than a distance metric. Assume we have model neurons that are tuned in a high-dimensional feature space proportional to the distance to a fixed point. Dimension reduction on a population of such units using principal component analysis reveals that the main PCs are almost linearly tuned in the space (for quantification see Figure S6B). B2 and B3, same as A2 and A3 but using 10 input units. For high dimensionality, axis metrics outperformed the distance metrics. C, Axis metrics are more distributed and more robust to noise than distance metrics. C1, Weight matrices of networks in B after training using units defined by an axis metric or a distance metric (white indicates large weight). Weights for the distance metric are mostly on the diagonal, while those for the axis metric are more distributed. C2, Distributed inputs could help average signals with independent noise, resulting in high signal-to-noise (upper). The same network in A2 was trained with noisy inputs (lower). C3, Error rates of both models plotted against dimensionality. D, The linear relationship between neuronal responses and facial features ensures diverse tasks can be performed. The gray disks indicate face space.

Comment in

  • How Do We Recognize a Face?
    Quian Quiroga R. Quian Quiroga R. Cell. 2017 Jun 1;169(6):975-977. doi: 10.1016/j.cell.2017.05.012. Cell. 2017. PMID: 28575674
  • Neural coding: Face values.
    Bray N. Bray N. Nat Rev Neurosci. 2017 Aug;18(8):456. doi: 10.1038/nrn.2017.81. Epub 2017 Jun 22. Nat Rev Neurosci. 2017. PMID: 28638121 No abstract available.
  • Reading Faces: From Features to Recognition.
    Guntupalli JS, Gobbini MI. Guntupalli JS, et al. Trends Cogn Sci. 2017 Dec;21(12):915-916. doi: 10.1016/j.tics.2017.09.007. Epub 2017 Sep 19. Trends Cogn Sci. 2017. PMID: 28939331
  • Commentary: The Code for Facial Identity in the Primate Brain.
    Rossion B, Taubert J. Rossion B, et al. Front Hum Neurosci. 2017 Nov 14;11:550. doi: 10.3389/fnhum.2017.00550. eCollection 2017. Front Hum Neurosci. 2017. PMID: 29184489 Free PMC article. No abstract available.

Similar articles

Cited by

References

    1. Beymer D, and Poggio T (1996). Image representations for visual learning. Science 272, 1905–1909. - PubMed
    1. Blanz V, and Vetter T (1999). A morphable model for the synthesis of 3D faces. Comp Graph, 187–194.
    1. Bookstein FL (1989). Principal Warps - Thin-Plate Splines and the Decomposition of Deformations. Ieee T Pattern Anal 11, 567–585.
    1. Brincat SL, and Connor CE (2004). Underlying principles of visual shape selectivity in posterior inferotemporal cortex. Nat Neurosci 7, 880–886. - PubMed
    1. Cootes TF, Edwards GJ, and Taylor CJ (2001). Active appearance models. Ieee T Pattern Anal 23, 681–685.