Face Recognition with ResNet

Dilawar Mahmood

2020-03-11

Machine Learning, Mathematics, Neural Networks

I built a face recognition system during COVID lockdown and learned way more about ResNet than I expected. Your phone unlocking when it sees your face seems like magic, but it’s actually just some clever math and neural networks doing their thing. Here’s what I figured out while building my own version (check out the code).

Why Face Recognition?

Face recognition is everywhere now. Your phone, security systems, even some coffee shops use it. The cool thing about modern systems is they handle all the messy real world stuff like different lighting, weird angles, and even when part of your face is covered. That’s where ResNet comes in handy.

How Convolutions Work

Let’s start with the basics. Your input image is just a bunch of numbers arranged in a 3D grid: height × width × channels (like RGB). A convolution is basically sliding a small filter across this image and doing some math:

$$(\mathbf{X} * \mathbf{K})(i,j) = \sum_{p=0}^{k_h-1} \sum_{q=0}^{k_w-1} \sum_{c=0}^{C-1} \mathbf{X}(i+p,j+q,c) \cdot \mathbf{K}(p,q,c)$$

This looks scary, but it’s just saying: take your filter, multiply it with a patch of the image, add everything up, and that’s your output for that spot. Early layers learn simple stuff like edges, and deeper layers combine these into more complex features like eyes and noses.

Making Things Nonlinear

After each convolution, we usually apply ReLU (Rectified Linear Unit):

$$\text{ReLU}(z) = \max(0,z)$$

This just means “if it’s negative, make it zero.” Without this, our network would just be doing fancy matrix multiplication, which can’t learn complex patterns.

Multiple Channels

Real networks have many filters per layer. So instead of one output, you get:

$$\mathbf{Y} = f(\mathbf{X} * \mathbf{K}_1, \mathbf{X} * \mathbf{K}_2, \ldots, \mathbf{X} * \mathbf{K}_N)$$

Each filter learns to detect different features, and stacking them gives the network its power.

Why ResNet Is Clever

The problem with deep networks used to be that gradients would disappear as they traveled back through all those layers during training. ResNet solved this with skip connections. Instead of just passing data through layers sequentially, it adds shortcuts:

$$\mathbf{X} = \mathbf{Z} + F(\mathbf{Z})$$

Here, $F$ does all the convolution work, but we also add the original input $\mathbf{Z}$ directly to the output. This gives gradients a direct path back to earlier layers.

A typical ResNet block looks like:

$$F(\mathbf{X}) = W_2\sigma(\text{BN}(W_1\mathbf{X}))$$
$$\mathbf{X}_{\text{out}} = \mathbf{X} + F(\mathbf{X})$$

The math is just saying: do some convolutions and batch normalization, then add the result to what you started with.

Building a Face Recognition System

Getting the Data Ready

First, you need to find faces in images. There are lots of ways to do this, from old school Haar cascades to modern CNN detectors. Once you find a face, you crop it out and maybe align it so all the faces are oriented the same way.

Turning Faces Into Numbers

The real magic happens when you pass cropped faces through your ResNet. Instead of trying to classify faces directly, you usually take the output from a deep layer as a compact representation of that face. This gives you a vector of numbers that captures what makes that face unique.

Making Decisions

For identifying specific people, you can add a classification layer on top:

$$\hat{y} = \text{softmax}(\mathbf{z}) = \frac{e^\mathbf{z}}{\sum_{j=1}^K e^{z_j}}$$

Train it with cross entropy loss:

$$\mathcal{L} = -\sum_{i=1}^K y_i \log(\hat{y}_i)$$

Or you can compare face embeddings directly using distance metrics for more flexible matching.

Some Actual Code

Here’s a simplified ResNet for face recognition. Real frameworks like TensorFlow already have ResNet50, but building a smaller version helps you understand what’s happening:

import tensorflow as tf
from tensorflow.keras import layers, models, regularizers

def residual_block(x: tf.Tensor, filters: int, stride: int = 1) -> tf.Tensor:
    """
    Applies a simple residual block to the input tensor 'x' using 
    the specified number of 'filters' and convolution 'stride'.
    
    Args:
        x: Input feature map.
        filters: Number of convolution filters.
        stride: Stride in the main convolution layers.
    
    Returns:
        tf.Tensor: Output of the residual block.
    """
    shortcut = x
    
    # First convolution
    x = layers.Conv2D(filters, kernel_size=3, strides=stride, padding='same',
                      kernel_regularizer=regularizers.l2(1e-4))(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    
    # Second convolution
    x = layers.Conv2D(filters, kernel_size=3, strides=1, padding='same',
                      kernel_regularizer=regularizers.l2(1e-4))(x)
    x = layers.BatchNormalization()(x)
    
    # If dimensions changed, adjust the shortcut
    if stride != 1 or shortcut.shape[-1] != filters:
        shortcut = layers.Conv2D(filters, kernel_size=1, strides=stride,
                                 padding='same')(shortcut)
        shortcut = layers.BatchNormalization()(shortcut)
    
    # Add skip connection
    x = layers.Add()([x, shortcut])
    x = layers.Activation('relu')(x)
    return x

def build_resnet(input_shape: tuple = (64, 64, 3),
                 num_classes: int = 16) -> tf.keras.Model:
    """
    Builds a miniature ResNet like model for face recognition.
    
    Args:
        input_shape: Shape of the input images.
        num_classes: Number of classes (faces) to predict.
    
    Returns:
        tf.keras.Model: A compiled Keras model representing 
                        a ResNet style network.
    """
    inputs = layers.Input(shape=input_shape)
    
    # Initial Conv Layer
    x = layers.Conv2D(16, kernel_size=3, strides=1, padding='same')(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    
    # Stack of residual blocks
    x = residual_block(x, filters=16, stride=1)
    x = residual_block(x, filters=16, stride=1)
    x = residual_block(x, filters=32, stride=2)
    x = residual_block(x, filters=32, stride=1)
    
    # Global average pooling
    x = layers.GlobalAveragePooling2D()(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    
    model = models.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

def train_face_recognition_model():
    """
    Demonstrates training a mini ResNet model on dummy data.
    Replace 'x_train, y_train' with real face data for practical usage.
    """
    # Example dataset placeholders
    x_train = tf.random.normal((256, 64, 64, 3))  # (batch, height, width, channels)
    y_train = tf.random.uniform((256,), minval=0, maxval=16, dtype=tf.int32)
    y_train = tf.one_hot(y_train, depth=16)
    
    # Build and train the model
    model = build_resnet()
    model.summary()
    model.fit(x_train, y_train, epochs=5, batch_size=32, verbose=1)

if __name__ == "__main__":
    train_face_recognition_model()

The residual_block function is where the skip connection magic happens. The build_resnet function stacks these blocks together with some pooling and classification layers on top.

The Mask Problem

COVID threw everyone a curveball when suddenly half your face was covered. Models trained on full faces suddenly couldn’t recognize people wearing masks. The solution? Train on masked faces too, or focus the model on the parts around the eyes and forehead that are still visible.

What I Learned

Building this face recognition system taught me that the math behind these systems isn’t as scary as it looks. ResNet’s skip connections are actually a pretty simple idea that solved a big problem. And sometimes the biggest challenge isn’t the algorithm itself, but dealing with real world changes like everyone suddenly wearing masks.

The key is understanding that each piece builds on the last: convolutions detect features, skip connections help with training deep networks, and the whole system learns to turn faces into unique fingerprints that can be compared mathematically.