Face Recognition with ResNet

I built a face recognition system during COVID lockdown and learned way more about ResNet than I expected. Your phone unlocking when it sees your face seems like magic, but it’s actually just some clever math and neural networks doing their thing. Here’s what I figured out (code on GitHub).

Why Face Recognition?

Face recognition is everywhere now — your phone, security systems, even some coffee shops. Modern systems handle messy real-world stuff like different lighting, weird angles, and even when part of your face is covered. ResNet makes a lot of this possible.

Convolutions Are Simpler Than They Look

Your input image is just a bunch of numbers arranged in a 3D grid: height × width × channels (like RGB). A convolution is basically sliding a small filter across this image and doing some math:

$$(\mathbf{X} * \mathbf{K})(i,j) = \sum_{p=0}^{k_h-1} \sum_{q=0}^{k_w-1} \sum_{c=0}^{C-1} \mathbf{X}(i+p,j+q,c) \cdot \mathbf{K}(p,q,c)$$

Don’t let the notation scare you. You take your filter, multiply it with a patch of the image, add everything up, and that’s your output for that spot. Early layers learn simple stuff like edges, and deeper layers combine these into more complex features like eyes and noses.

Adding Nonlinearity with ReLU

After each convolution, we usually apply ReLU (Rectified Linear Unit):

$$\text{ReLU}(z) = \max(0,z)$$

If it’s negative, make it zero. That’s it. Without this, our network would just be doing fancy matrix multiplication, which can’t learn complex patterns.

Using Multiple Filters

Real networks have many filters per layer. So instead of one output, you get:

$$\mathbf{Y} = f(\mathbf{X} * \mathbf{K}_1, \mathbf{X} * \mathbf{K}_2, \ldots, \mathbf{X} * \mathbf{K}_N)$$

Each filter learns to detect different features. Stacking them is what gives the network its power.

The ResNet Trick

Deep networks used to have a problem: gradients would disappear as they traveled back through all those layers during training. ResNet fixed this with skip connections — instead of just passing data through layers sequentially, it adds shortcuts:

$$\mathbf{X} = \mathbf{Z} + F(\mathbf{Z})$$

$F$ does all the convolution work, but we also add the original input $\mathbf{Z}$ directly to the output. Gradients get a direct path back to earlier layers.

A typical ResNet block:

$$F(\mathbf{X}) = W_2\sigma(\text{BN}(W_1\mathbf{X}))$$
$$\mathbf{X}_{\text{out}} = \mathbf{X} + F(\mathbf{X})$$

Do some convolutions and batch normalization, then add the result to what you started with.

Building a Face Recognition System

Finding and Preparing Faces

First, you need to find faces in images. There are lots of ways to do this — from old school Haar cascades to modern CNN detectors. Once you find a face, crop it out and maybe align it so all the faces are oriented the same way.

From Faces to Numbers

The real magic: pass cropped faces through your ResNet. Instead of trying to classify faces directly, take the output from a deep layer as a compact representation of that face. You get a vector of numbers that captures what makes that face unique.

Recognition

For identifying specific people, add a classification layer on top:

$$\hat{y} = \text{softmax}(\mathbf{z}) = \frac{e^\mathbf{z}}{\sum_{j=1}^K e^{z_j}}$$

Train it with cross entropy loss:

$$\mathcal{L} = -\sum_{i=1}^K y_i \log(\hat{y}_i)$$

Or compare face embeddings directly using distance metrics for more flexible matching.

Implementation

Here’s a simplified ResNet for face recognition. TensorFlow already has ResNet50, but building a smaller version helps you understand what’s happening:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import tensorflow as tf
from tensorflow.keras import layers, models, regularizers

def residual_block(x: tf.Tensor, filters: int, stride: int = 1) -> tf.Tensor:
"""
Applies a simple residual block to the input tensor 'x' using
the specified number of 'filters' and convolution 'stride'.

Args:
x: Input feature map.
filters: Number of convolution filters.
stride: Stride in the main convolution layers.

Returns:
tf.Tensor: Output of the residual block.
"""
shortcut = x

# First convolution
x = layers.Conv2D(filters, kernel_size=3, strides=stride, padding='same',
kernel_regularizer=regularizers.l2(1e-4))(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)

# Second convolution
x = layers.Conv2D(filters, kernel_size=3, strides=1, padding='same',
kernel_regularizer=regularizers.l2(1e-4))(x)
x = layers.BatchNormalization()(x)

# If dimensions changed, adjust the shortcut
if stride != 1 or shortcut.shape[-1] != filters:
shortcut = layers.Conv2D(filters, kernel_size=1, strides=stride,
padding='same')(shortcut)
shortcut = layers.BatchNormalization()(shortcut)

# Add skip connection
x = layers.Add()([x, shortcut])
x = layers.Activation('relu')(x)
return x

def build_resnet(input_shape: tuple = (64, 64, 3),
num_classes: int = 16) -> tf.keras.Model:
"""
Builds a miniature ResNet like model for face recognition.

Args:
input_shape: Shape of the input images.
num_classes: Number of classes (faces) to predict.

Returns:
tf.keras.Model: A compiled Keras model representing
a ResNet style network.
"""
inputs = layers.Input(shape=input_shape)

# Initial Conv Layer
x = layers.Conv2D(16, kernel_size=3, strides=1, padding='same')(inputs)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)

# Stack of residual blocks
x = residual_block(x, filters=16, stride=1)
x = residual_block(x, filters=16, stride=1)
x = residual_block(x, filters=32, stride=2)
x = residual_block(x, filters=32, stride=1)

# Global average pooling
x = layers.GlobalAveragePooling2D()(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)

model = models.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
loss='categorical_crossentropy',
metrics=['accuracy'])
return model

def train_face_recognition_model():
"""
Demonstrates training a mini ResNet model on dummy data.
Replace 'x_train, y_train' with real face data for practical usage.
"""
# Example dataset placeholders
x_train = tf.random.normal((256, 64, 64, 3)) # (batch, height, width, channels)
y_train = tf.random.uniform((256,), minval=0, maxval=16, dtype=tf.int32)
y_train = tf.one_hot(y_train, depth=16)

# Build and train the model
model = build_resnet()
model.summary()
model.fit(x_train, y_train, epochs=5, batch_size=32, verbose=1)

if __name__ == "__main__":
train_face_recognition_model()

The residual_block function is where the skip connection magic happens. The build_resnet function stacks these blocks together with some pooling and classification layers on top.

Masks Broke Everything

COVID threw everyone a curveball. Models trained on full faces suddenly couldn’t recognize people wearing masks. The fix? Train on masked faces too, or focus the model on parts around the eyes and forehead that stay visible.

Wrapping Up

The math behind face recognition looks scarier than it is. ResNet’s skip connections are a simple idea that solved a big problem. Sometimes the hardest part isn’t the algorithm — it’s dealing with real world changes like everyone suddenly wearing masks.

Convolutions detect features, skip connections help train deep networks, and the whole system learns to turn faces into unique fingerprints that can be compared mathematically. Each piece builds on the previous one.