
Have you ever wondered what happens under the hood when your phone instantly unlocks after scanning your face? Face recognition systems combine sophisticated deep learning architectures — particularly Convolutional Neural Networks (CNNs) with Residual Networks (ResNet) — to process and match your facial features against stored representations.
In this post, I’ll lay out the mathematics behind convolutional operations, explain how ResNet’s skip connections stabilize deeper networks, and share a bit of Python code. Along the way, I’ll sprinkle in insights gleaned from coding a face recognition project during the COVID-19 lockdown (GitHub repository). Feel free to reach out if you spot any mistakes or want to chat more about these ideas.
Why Face Recognition?
Face recognition plays a huge role in everything from smartphone security to biometric presence systems. Unlike older computer vision techniques, deep CNN-based methods produce highly discriminative features to handle illumination changes, slight occlusions, and other intraclass variations. Residual Networks (ResNet) further push the boundaries by easing the training of very deep models.
Mathematical Foundations of CNNs
Convolution Operation
Let’s denote our input image as a three-dimensional tensor $\mathbf{X} \in \mathbb{R}^{H \times W \times C}$, where $H$ is the height, $W$ is the width, and $C$ is the number of channels (e.g. 3 for RGB). A convolution with a kernel $\mathbf{K} \in \mathbb{R}^{k_h \times k_w \times C}$ is defined by:
$$(\mathbf{X} * \mathbf{K})(i,j) = \sum_{p=0}^{k_h-1} \sum_{q=0}^{k_w-1} \sum_{c=0}^{C-1} \mathbf{X}(i+p,j+q,c) \cdot \mathbf{K}(p,q,c)$$
where $(i,j)$ indexes the spatial position in the output feature map. CNNs stack multiple convolutional layers, interspersed with activation functions (e.g., ReLU) and pooling layers, to learn hierarchical representations. Early filters typically learn edge and corner detectors, while deeper layers combine these into more complex structures like eyes, mouth, etc.
Non-Linear Activations
Activations such as the Rectified Linear Unit (ReLU) transform the output of each convolution:
$$\text{ReLU}(z) = \max(0,z).$$
This non-linearity allows the network to approximate complex, non-linear mappings far beyond simple matrix multiplications.
Multiple Output Channels
When a layer has $N$ output channels, we maintain $N$ such convolutional filters:
$$\mathbf{Y} = f(\mathbf{X} * \mathbf{K}_1, \mathbf{X} * \mathbf{K}_2, \ldots, \mathbf{X} * \mathbf{K}_N),$$
for some activation function $f$. Stacking more channels and layers in turn gives CNNs their representational power.
Residual Networks (ResNet)
Proposed in 2015, Residual Networks address the vanishing gradient problem by introducing shortcut (skip) connections (also referred to as “residual connections”). A typical ResNet “building block” can be represented mathematically as:
$$\mathbf{X} = \mathbf{Z} + F(\mathbf{Z}),$$
where $F$ is typically a composition of two or three convolutions plus batch normalization and ReLU. The skip connection $\mathbf{Z}$ makes it easier for gradients to flow from deeper layers to shallower layers, stabilizing training and enabling much deeper networks.
Formally, suppose a “basic block” applies transformations $W_1$ and $W_2$ (each representing weights of convolution kernels) with ReLU and batch normalization in between:
$$F(\mathbf{X}) = W_2\sigma(\text{BN}(W_1\mathbf{X})),$$
where $\sigma$ is the ReLU function and BN denotes batch normalization. The output is then:
$$\mathbf{X}_{\text{out}} = \mathbf{X} + F(\mathbf{X}).$$
In face recognition, ResNet-50-based models have demonstrated strong performance even under challenging conditions like masked faces, small training data sets, or presence-based systems.
Face Recognition Pipeline
Data Preprocessing
Face Detection: Use a face detector (e.g., Haar cascades, MTCNN, HOG-based, or CNN-based detectors) to localize faces in an image.
Alignment (Optional): Align faces to a canonical orientation by detecting key facial landmarks (eyes, nose, mouth).
Feature Extraction and Embeddings
After alignment, each cropped face is passed through a ResNet or other CNN. The output of a deep layer (often the penultimate layer) acts as a compact embedding $\mathbf{z} \in \mathbb{R}^d$. Euclidean distance or cosine similarity between embeddings measures face similarity.
Classification or Metric Learning
If your goal is to identify $K$ known individuals, you can place a dense layer on top of the CNN, producing a $K$-dimensional vector. For classification, use a softmax function:
$$\hat{y} = \text{softmax}(\mathbf{z}) = \frac{e^\mathbf{z}}{\sum_{j=1}^K e^{z_j}},$$
and train using cross-entropy loss:
$$\mathcal{L} = -\sum_{i=1}^K y_i \log(\hat{y}_i),$$
where $\mathbf{y} \in {0,1}^K$ is the one-hot ground truth vector.
For open-set recognition or verification tasks, you might use a metric learning approach (e.g., triplet loss, Siamese networks) that compares embeddings directly rather than performing classification.
Python Code Example
Below is an illustrative snippet combining a simple ResNet-like architecture for face recognition. Note that frameworks such as TensorFlow and PyTorch already provide ResNet50 out of the box. Here, we build a smaller toy version for demonstration:
1 | import tensorflow as tf |
Explanation
residual_block
defines the convolutional + skip-connection logic.build_resnet
stacks multiple residual blocks, culminating in a global average pooling and dense classification layer.train_face_recognition_model
includes random input to show usage. For real-world face recognition, you’d replace the dummy data with actual face images.
A Note on Masked Faces
The COVID-19 pandemic introduced a new twist: mask-wearing. CNNs trained solely on full-face images often struggle with occlusions. Solutions include fine-tuning your model on masked face data or specifically modifying the architecture to emphasize visible features around the eyes and forehead. Incorporating masked faces in your training set can mitigate performance loss.
Conclusion
I hope this deep, math-heavy excursion into face recognition with CNN/ResNet gave you a better grasp of how everything fits together. If you have any questions or suggestions, please reach out. Enjoy experimenting with your own projects!