Neural Network

Neural network project covering theory, structure, backpropagation, and coding in Python

Posted on August 13, 2025 by Ramtin

Machine Learning Notes Neural Network

Summary of

Neural Network Theory

Neural Network Structure:

backward pass for w:

backward pass for b:

backward pass for w mathematically:

\[L =\underbrace{ L(\underbrace{ \hat{y} \underbrace{ (z(Xw+b)) }_{\frac{\partial z} {\partial w} } }_{\frac{\partial \hat{y}} {\partial z}} )}_{\frac{\partial L} {\partial \hat{y}}}\] \[\frac{\partial L} {\partial w} = \underbrace{\frac{\partial L} {\partial z}}_{ \frac{\partial L} {\partial \hat{y}} \frac{\partial \hat{y}} {\partial z} } \frac{\partial z} {\partial w}\] as a Summary:

\[L = \underbrace{L\left(\hat{y}\right)}_{\frac{\partial L}{\partial \hat{y}}} = \underbrace{L\left(\hat{y}\left(z\right)\right)}_{\frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}} = \underbrace{L\left(\hat{y}\left(z(XW + b)\right)\right)}_{\frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}}\]

Or more explicitly for the chain rule:

\[\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z} \times \frac{\partial z}{\partial W}\]

\(z = XW + b\)
\(\hat{y} = f(z)\) (activation function, e.g. softmax)
\(L = loss(\hat{y}, y)\)

The derivatives \(\frac{\partial L}{\partial \hat{y}}\), \(\frac{\partial \hat{y}}{\partial z}\), and \(\frac{\partial z}{\partial W}\) represent the gradient chain from output all the way down to weights.

Coding the Neural Network

Neural Network Structure:

\[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}\]

where:

Loss derivative: \(\frac{\partial L}{\partial \hat{y}}\)
Activation derivative: \(\frac{\partial \hat{y}}{\partial z}\)
Linear model derivative: \(\frac{\partial z}{\partial w}\)

1. Loss derivative \(\left(\frac{\partial L}{\partial \hat{y}}\right)\):

This is computed in:

  def backward(self) -> np.ndarray:  # CrossEntropy
      grad = -self.target / self.prediction / self.target.shape[0]
      return grad

self.prediction = \(\hat{y}\) (output of softmax)
This gives the gradient from the loss with respect to the softmax output \(\hat{y}\).

Activation derivative \(\left(\frac{\partial \hat{y}}{\partial z}\right)\):

If you have a Softmax layer:

  def backward(self, up_grad: np.ndarray) -> np.ndarray:  # Softmax
      ...
      down_grad[i] = np.dot(jacobian, up_grad[i])

up_grad is \(\frac{\partial L}{\partial \hat{y}}\) from the loss.
The softmax Jacobian gives \(\frac{\partial \hat{y}}{\partial z}\).
Output down_grad is \(\frac{\partial L}{\partial z}\).
This matches step 2 of the chain rule.

Linear model derivative \(\left(\frac{\partial z}{\partial w}\right)\):

In your Linear layer:

  def backward(self, up_grad: np.ndarray) -> np.ndarray:  # Linear
      self.dw = np.dot(self.inp.T, up_grad)  # ∂L/∂w
      self.db = np.sum(up_grad, axis=0, keepdims=True)  # ∂L/∂b
      down_grad = np.dot(up_grad, self.w.T)  # ∂L/∂input
      return down_grad

up_grad is \(\frac{\partial L}{\partial z}\) from the activation.
Multiplying with inp.T applies \(\frac{\partial z}{\partial w}\) to get \(\frac{\partial L}{\partial w}\).
This is step 3 of the professor’s chain rule.

Full Chain in Code:

Loss backward \(\rightarrow\) CrossEntropy.backward() \[\frac{\partial L}{\partial \hat{y}}\]
Activation backward \(\rightarrow\) Softmax.backward() \[\frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}\]
Linear backward \(\rightarrow\) Linear.backward() \[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w}\]

Linear :

Abstract Python Code for Layer:

  class Layer:
    def __init__(self):
        self.inp = None
        self.out = None

    def __call__(self, inp: np.ndarray) -> np.ndarray:
        return self.forward(inp)

    def forward(self, inp: np.ndarray) -> np.ndarray:
        raise NotImplementedError

    def backward(self, up_grad: np.ndarray) -> np.ndarray:
        raise NotImplementedError

    def step(self, lr: float) -> None:
        pass

Input features: \[a\]
Features weights: \[a\]
Bias term: \[a\]
Activation function : \[a\]
Output of the neuron: \[y\]

Neural Network Theory

Coding the Neural Network

Let's build something together