Building Neural Networks From Scratch

This post is inspired by Michael Nielsen’s “Neural Networks and Deep Learning” and 3Blue1Brown’s excellent video series on neural networks. Their clear explanations of the mathematical foundations helped shape my understanding of these concepts.

I’ve been implementing neural networks in my free time using TensorFlow, but I wanted to understand what’s actually happening under all those abstractions. This post documents my implementation of a neural network using only NumPy, focusing on the mathematical foundations and key implementation details.

Core Architecture

A neural network consists of layers of neurons that perform matrix operations with non-linear transformations. Each neuron computes:

$$z=\sum_{i=1}^{n}w_ix_i+b$$

Let’s implement a three-layer network to understand these concepts:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np

class NeuralNetwork:
"""
Neural network implementation using only NumPy.
Supports arbitrary layer dimensions with sigmoid activation.

Attributes:
layer_dims: Dimensions of each network layer
parameters: Network weights and biases
"""
def __init__(self, layer_dims: list[int]) -> None:
"""
Initialize network with specified layer dimensions

Args:
layer_dims: List of integers specifying nodes in each layer
"""
self.layer_dims = layer_dims
self.parameters = self._initialize_parameters()

Parameter Initialization

Proper weight initialization turns out to be critical for training stability. The Xavier initialization helps maintain reasonable activation magnitudes through the network:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def _initialize_parameters(self) -> dict[str, np.ndarray]:
"""
Initialize network parameters using Xavier initialization

Returns:
Dictionary containing weights (W) and biases (b) for each layer
"""
parameters: dict[str, np.ndarray] = {}

for l in range(1, len(self.layer_dims)):
parameters[f"W{l}"] = np.random.randn(
self.layer_dims[l],
self.layer_dims[l-1]
) * np.sqrt(1./self.layer_dims[l])
parameters[f"b{l}"] = np.zeros((self.layer_dims[l], 1))

return parameters

Forward Propagation

The forward pass computes activations by applying the sigmoid function:

$$\sigma\left(z\right)=\frac{1}{1+e^{-z}}$$

Here’s the implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def forward_propagation(
self,
X: np.ndarray
) -> tuple[np.ndarray, dict[str, np.ndarray]]:
"""
Compute forward pass through the network

Args:
X: Input data of shape (input_size, m) where m is batch size

Returns:
Tuple containing:
- Output activations
- Cache of intermediate values for backpropagation
"""
cache: dict[str, np.ndarray] = {}
A = X

for l in range(1, len(self.layer_dims)):
Z = np.dot(self.parameters[f"W{l}"], A) + self.parameters[f"b{l}"]
A = 1/(1 + np.exp(-Z))
cache[f"A{l}"] = A
cache[f"Z{l}"] = Z

return A, cache

Loss Function

For binary classification tasks, we use the binary cross-entropy loss:

$$E = -\frac{1}{m}\sum_{i=1}^m\left[y_i\log(a_i) + (1-y_i)\log(1-a_i)\right]$$

This loss function penalizes predictions that are confident but wrong more heavily than those that are uncertain. It approaches zero as the predictions get closer to the true labels.

Gradient Descent

The network learns through backpropagation by computing gradients of the loss with respect to each parameter:

$$\frac{\partial E}{\partial w_i} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w_i}$$

These gradients drive the weight updates:

$$w_i := w_i - \alpha \frac{\partial E}{\partial w_i}$$

Here’s the training loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def train(
self,
X: np.ndarray,
Y: np.ndarray,
learning_rate: float = 0.1,
epochs: int = 1000
) -> list[float]:
"""
Train the network using gradient descent

Args:
X: Training data of shape (input_size, m)
Y: Target values of shape (output_size, m)
learning_rate: Step size for gradient descent
epochs: Number of training iterations

Returns:
List of training losses per epoch
"""
losses: list[float] = []
m = X.shape[1]

for _ in range(epochs):
A, cache = self.forward_propagation(X)
loss = -1/m * np.sum(Y * np.log(A) + (1-Y) * np.log(1-A))
losses.append(float(loss))

self._backward_propagation(X, Y, cache, learning_rate)

return losses

Backpropagation

The backpropagation algorithm computes gradients through the network using the chain rule. For the final layer, we compute the error gradient:

$$\frac{\partial L}{\partial z^{[L]}} = A^{[L]} - Y$$

For hidden layers, the gradient calculation becomes:

$$\frac{\partial L}{\partial z^{[l]}} = W^{[l+1]T} \frac{\partial L}{\partial z^{[l+1]}} \odot \sigma’(z^{[l]})$$

Here’s the implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def _backward_propagation(
self,
X: np.ndarray,
Y: np.ndarray,
cache: dict[str, np.ndarray],
learning_rate: float
) -> None:
"""
Compute gradients and update network parameters

Args:
X: Input data of shape (input_size, m)
Y: Target values of shape (output_size, m)
cache: Dictionary containing intermediate values from forward pass
learning_rate: Step size for gradient descent
"""
m = X.shape[1]
L = len(self.layer_dims) - 1

dA = -(np.divide(Y, cache[f"A{L}"]) -
np.divide(1 - Y, 1 - cache[f"A{L}"]))

for l in reversed(range(1, L + 1)):
Z = cache[f"Z{l}"]
A_prev = cache[f"A{l-1}"] if l > 1 else X

dZ = dA * (cache[f"A{l}"] * (1 - cache[f"A{l}"]))

dW = 1/m * np.dot(dZ, A_prev.T)
db = 1/m * np.sum(dZ, axis=1, keepdims=True)

self.parameters[f"W{l}"] -= learning_rate * dW
self.parameters[f"b{l}"] -= learning_rate * db

if l > 1:
dA = np.dot(self.parameters[f"W{l}"].T, dZ)

The gradient computation follows from differentiating our loss function with respect to each parameter. For weights, we get:

$$\frac{\partial L}{\partial W^{[l]}} = \frac{1}{m}\frac{\partial L}{\partial z^{[l]}}A^{[l-1]T}$$

And for biases:

$$\frac{\partial L}{\partial b^{[l]}} = \frac{1}{m}\sum_{i=1}^m \frac{\partial L}{\partial z_i^{[l]}}$$

Final Thoughts

A few implementation details worth noting:

  • First, the gradient calculations are vectorized across the entire batch
  • Additionally, we cache activations during forward propagation to avoid recomputing them
  • Finally, note that the sigmoid derivative has a convenient form: $\sigma’(z) = \sigma(z)(1-\sigma(z))$

This completes our neural network implementation. The interplay between forward propagation and backpropagation allows the network to iteratively refine its parameters, gradually improving its predictions.