Multi-Layer Perceptrons from Scratch

If you've ever wondered how computers learn to recognize handwritten numbers, you're about to find out. Today, we're going to understand the most important steps when building a neural network from scratch.

Don't worry if "neural network" sounds intimidating. By the end of this article, you'll understand not just what these systems are, but how they actually work. We'll be exploring a specific type called a Multi-Layer Perceptron, or MLP for short, and we'll see how it learns to read handwritten digits.

What Exactly Is a Neural Network?

Think of your brain for a moment. It contains billions of neurons that connect to each other, passing electrical signals back and forth. When you learn something new, these connections strengthen or weaken. Over time, patterns emerge that help you recognize faces, understand language, or identify a handwritten "7" even when it's written sloppily.

Artificial neural networks work on a similar principle. Instead of biological neurons, we have mathematical units that receive inputs, process them, and pass results forward. These artificial neurons are organized in layers, with each layer transforming the information it receives before passing it to the next layer.

A Feedforward Neural Network is the simplest form of this architecture. Information flows in one direction only: from input to output, like water flowing downstream. There are no loops or backwards connections during the forward pass. You feed data into one end, it travels through the network, and predictions come out the other side.”

“Hello, Multi-Layer Perceptron”

The Multi-Layer Perceptron is a specific type of feedforward network that has earned its place as a fundamental building block of modern AI. The name might sound strange or complex, but it's actually quite descriptive.

"Multi-layer" simply means the network has more than one layer of processing units. "Perceptron" is a fancy term for a basic computational unit that makes simple decisions based on its inputs. String many of these together in layers, and you get an MLP.

Here's what makes an MLP special: it has at least three types of layers. First, there's an input layer that receives your data. Then come one or more hidden layers where the real magic happens. These hidden layers learn to detect patterns and features in your data. Finally, there's an output layer that produces the network's prediction.

The "hidden" layers earned their name because you can't directly observe what they're learning. They're sandwiched between input and output, working behind the scenes to transform raw data into meaningful predictions.

MNIST Dataset → A Perfect Training Ground

When you're learning to build neural networks, you need a dataset that's challenging enough to be interesting but simple enough to understand. For this we have the MNIST, which stands for Modified National Institute of Standards and Technology database.

MNIST contains 70,000 images of handwritten digits from 0 to 9. Each image is tiny by modern standards, just 28 pixels by 28 pixels, rendered in grayscale. The dataset includes 60,000 images for training your network and 10,000 for testing how well it learned.

These images came from real people with real handwriting variations. Some wrote their sevens with a horizontal line through the middle. Others made their nines look confusingly similar to fours. This variety makes MNIST an excellent benchmark for testing if your network can truly learn to generalise.

For decades, MNIST has served as the "hello world" of deep learning. It's complex enough to require a proper neural network but simple enough that you can train one in minutes on a regular computer.

The Building Blocks of an MLP

Building a neural network from scratch means understanding its fundamental components. Let's break them down one by one.

At the heart of any neural network are weights and biases. Weights are numbers that determine how much influence one neuron has on another. If you imagine neurons as people in a social network, weights represent how much each person listens to each other person. Some connections are strong, others are weak, and some might even be negative.

Biases are like each neuron's personal threshold for activation. They help the network learn patterns that don't pass through the origin. Without biases, your network would be severely limited in what it could learn.

When a neuron receives inputs from the previous layer, it multiplies each input by its corresponding weight, adds all these weighted inputs together, then adds the bias. This gives you what's called the "net input" or "pre-activation."

But here's the thing: if you just used this net input directly, your entire neural network would be no more powerful than a simple linear equation. No matter how many layers you stacked, the whole thing would collapse into a single linear transformation.

This is where activation functions come in. An activation function takes that net input and transforms it in a non-linear way. The most common activation function in traditional MLPs is called the sigmoid function, which squashes any input into a value between 0 and 1. It creates an S-shaped curve that's smooth and differentiable, properties that become important during training.

Think of the sigmoid function as a soft decision maker. Instead of hard yes-or-no choices, it provides gradations of activation. An input of 0 gives you 0.5, large positive inputs approach 1, and large negative inputs approach 0.

How an MLP Makes Predictions

Now that we understand the components, let's see how they work together. This process is called forward propagation, and it's quite straightforward.

Imagine you're showing the network a handwritten digit. That image has 28×28 pixels, which equals 784 pixel values. Each pixel becomes one input to your network. So your input layer has 784 neurons, one for each pixel value.

These 784 values then get multiplied by weights connecting the input layer to the first hidden layer. Let's say you decided your hidden layer should have 50 neurons. That means you need 784×50 weights, plus 50 biases. Each of the 50 hidden neurons receives a weighted combination of all 784 input pixels.

Each hidden neuron applies its activation function to its net input, producing an activation value between 0 and 1. Now you have 50 activation values representing patterns the hidden layer detected in the input image.

These 50 hidden activations become inputs to the output layer. Since we're classifying digits 0 through 9, we need 10 output neurons, one for each possible digit. Each output neuron receives weighted inputs from all 50 hidden neurons, applies its activation function, and produces a final value between 0 and 1.

The output neuron with the highest activation value represents the network's prediction. If the seventh output neuron has the highest value, the network is predicting the digit is a 6 (counting from 0).

This entire process happens in milliseconds. Data flows through the network in one direction, from input to hidden to output, with each layer transforming the information it receives.

Teaching the Network to Learn

Making predictions is one thing. Making good predictions is another. Initially, your network's weights are set to small random values. Its predictions are essentially random guesses. The real challenge is teaching it to improve.

This is where the learning process begins. First, you need to measure how wrong the network's predictions are. This measurement is called the loss or cost function. For our digit classification task, we might use mean squared error, which simply measures the average squared difference between the network's predictions and the correct answers.

If the network predicts [0.1, 0.2, 0.7] for the three output neurons when the correct answer is the second digit (represented as [0, 1, 0]), the loss quantifies this discrepancy. The larger the loss, the worse the predictions.

Now comes the brilliant part: backpropagation. This algorithm calculates how each weight and bias in the network contributed to the overall loss. It literally propagates the error backward through the network, from output to input.

Think of it like tracing responsibility. If the output was wrong, which hidden neurons fed it bad information? And what weights caused those hidden neurons to activate incorrectly? Backpropagation uses calculus to precisely calculate each parameter's responsibility for the error.

The mathematical technique underlying backpropagation is called the chain rule. Remember from calculus how you can break down the derivative of nested functions? If you have f(g(x)), you can find its derivative by multiplying the derivative of f with respect to g by the derivative of g with respect to x.

Neural networks are just very long chains of nested functions. The input goes through a linear combination, then an activation function, then another linear combination, then another activation function, and so on. Backpropagation applies the chain rule systematically to compute how changing any weight affects the final loss.

Once you know each parameter's contribution to the error, you can update the parameters to reduce that error. This is done through gradient descent. You adjust each weight slightly in the direction that reduces the loss. The size of these adjustments is controlled by something called the learning rate.

The learning rate is a critical hyperparameter. Set it too high, and your network will overshoot good solutions, bouncing around wildly. Set it too low, and training will take forever, possibly getting stuck in local minima where the loss is low but not optimal.

Training in Mini-Batches

In practice, we don't update weights after every single training example. Computing gradients for one example at a time is inefficient. Computing gradients for all 60,000 training examples at once requires too much memory and computation.

The solution is mini-batch training, a middle ground that offers the best of both worlds. You divide your training data into small batches, perhaps 100 examples each. For each batch, you compute the average gradient across all examples in the batch, then update the weights based on that average.

This approach is more computationally efficient than processing one example at a time because you can use vectorized operations to process entire batches at once. It's also more stable than single-example updates because the gradient is averaged across multiple examples, smoothing out individual quirks.

When you've processed all the mini-batches once, you've completed one epoch of training. Typically, you'll train for many epochs, cycling through the entire dataset repeatedly. With each epoch, the network's predictions improve as the loss decreases.

Preparing the Data

Before you can train your network, you need to prepare the MNIST data properly. The raw images have pixel values ranging from 0 to 255, representing brightness from black to white.

Neural networks train more effectively when input features are normalised to a consistent scale. For MNIST, a common approach is to rescale pixel values to the range of -1 to 1. This centers the data around zero and prevents any single feature from dominating due to its scale.

You also need to split your data carefully. The 60,000 training images should be further divided into a training set and a validation set. The validation set lets you monitor how well your network generalizes to data it hasn't seen during training. This helps you detect overfitting, where the network memorizes training examples instead of learning general patterns.

The separate 10,000 test images remain completely untouched during training. You only use them at the very end to get an unbiased estimate of your network's performance on truly new data.

One more preparation step involves encoding the target labels. If you're trying to predict digit 7, you don't just feed the network the number 7. Instead, you use one-hot encoding: a vector with 10 elements, all zeros except for a 1 in the position corresponding to the correct digit. For 7, that would be [0, 0, 0, 0, 0, 0, 0, 1, 0, 0].

This encoding makes it easier to train the network and interpret its outputs. Each output neuron learns to predict the probability that the input image represents its corresponding digit.

Implementing the Network Architecture

When building an MLP from scratch, you need to decide on the network architecture. How many hidden layers should you use? How many neurons in each layer?

For MNIST, a single hidden layer with 50 neurons works reasonably well. Your network architecture looks like this: 784 input neurons (one per pixel), 50 hidden neurons, and 10 output neurons (one per digit class).

This means you need two weight matrices. The first connects input to hidden layer: 784 inputs times 50 hidden neurons gives you 39,200 weights. The second connects hidden to output: 50 hidden times 10 output neurons gives you 500 weights. You also need bias vectors: 50 biases for the hidden layer and 10 for the output layer.

These weights start with small random values drawn from a normal distribution. Starting with random weights ensures that neurons learn different patterns. If all weights started at the same value, all neurons would learn identically, and your network would fail to capture complex patterns.

The bias values typically start at zero. Unlike weights, having identical initial biases doesn't prevent neurons from learning different patterns.

The Training Loop

Training your MLP involves iterating through the following steps repeatedly. First, you shuffle your training data. Shuffling ensures that mini-batches contain a random mix of examples, preventing the network from learning spurious patterns based on data order.

Next, you divide the shuffled data into mini-batches. For each mini-batch, you perform forward propagation to generate predictions, compute the loss, and then run backpropagation to calculate gradients.

With gradients in hand, you update all weights and biases. Each parameter moves slightly in the direction that reduces the loss. The learning rate determines the step size.

After processing all mini-batches (completing one epoch), you evaluate the network's performance on the validation set. This gives you two key metrics: loss and accuracy. The validation loss tells you how far off the predictions are. The validation accuracy tells you what percentage of validation images the network classified correctly.

You monitor these metrics across epochs. Ideally, both training and validation performance improve together. If training performance improves but validation performance plateaus or worsens, your network is overfitting—memorizing training data instead of learning generalizable patterns.

For MNIST with our simple MLP, training for 50 epochs with a learning rate of 0.1 and mini-batches of 100 examples works well. Training takes just a few minutes on a regular computer, and the network achieves around 95% accuracy on the test set.

That means the network correctly identifies 95 out of every 100 handwritten digits it has never seen before. Not bad for a relatively simple architecture!

Understanding the Learned Patterns

After training, it's insightful to examine which digits the network struggles with. Often, you'll find that sevens with horizontal lines get confused with ones. Nines that look like fours cause problems. These are the same digits that confuse humans.

The network hasn't been explicitly programmed with rules about digit shapes. Instead, through repeated exposure to examples and continuous adjustment of its weights, it has learned statistical patterns that distinguish digits. The hidden layer neurons have learned to detect features like curves, lines, and corners that are diagnostic of particular digits.

This is the power of neural networks. Rather than hand-crafting rules (if there's a vertical line, it might be a one; if there's a loop on top, it might be a nine), you provide examples and let the network discover the relevant patterns through training.

The Challenge of Optimization

Training neural networks is harder than training simpler models like logistic regression. The loss function for a neural network is not convex. This means it has many local minima—points where the loss is lower than nearby points but not the absolute lowest possible.

Imagine a mountainous landscape where you're trying to find the lowest valley. In a convex optimization problem, there's only one valley, and any downhill direction eventually leads you there. In neural network optimization, there are many valleys, some deeper than others.

Starting from random initial weights, gradient descent might lead you to a shallow valley rather than the deepest one. This is why techniques like mini-batch training help. The randomness in mini-batch gradients can help the optimization process escape shallow local minima and find better solutions.

The learning rate plays a crucial role here too. A higher learning rate means larger steps, which can help jump out of local minima but might also overshoot good solutions. A lower learning rate takes smaller, more careful steps but might get stuck in the first local minimum it encounters.

Finding the right learning rate often requires experimentation. Too high and training becomes unstable, with loss bouncing around erratically. Too low and training becomes painfully slow, potentially stalling before finding a good solution.

Why Build from Scratch?

You might wonder why anyone would build a neural network from scratch when excellent libraries like PyTorch and TensorFlow exist. These libraries handle all the complex details automatically and run much faster by optimizing computations and leveraging GPUs.

The answer is understanding. Building an MLP from scratch, implementing forward propagation by hand, working through the backpropagation equations, and so on, demystify neural networks. You move from thinking of them as magical black boxes to seeing them as understandable systems built from simple components.

When you later use PyTorch or TensorFlow, you'll understand what's happening under the hood. You'll make better architectural choices, diagnose training problems more effectively, and appreciate the elegance of these tools.

It's similar to why computer science students learn to implement sorting algorithms even though they'll use built-in sorting functions in practice. The exercise builds fundamental understanding that pays dividends throughout your career.

Beyond This Simple Network

The MLP we've discussed is just the beginning. Modern neural networks build on these same principles but add sophisticated refinements.

Convolutional neural networks add specialized layers that detect visual patterns while respecting the spatial structure of images. Recurrent neural networks add loops that allow them to process sequences of varying length, like sentences or time series. Transformers use attention mechanisms to process information more flexibly.

Yet all of these architectures rely on the same fundamental concepts: layers of neurons, weighted connections, non-linear activations, loss functions, and backpropagation. Master the basics with an MLP, and you have the foundation to understand virtually any neural network architecture.

Takeaways

Let's recap the essential concepts. A Multi-Layer Perceptron is a type of feedforward neural network with at least one hidden layer between input and output. Information flows forward through weighted connections, with each neuron applying a non-linear activation function to produce its output.

Training uses backpropagation to calculate how each weight contributes to the prediction error, then gradient descent to adjust weights to reduce that error. Mini-batch training processes multiple examples at once, balancing efficiency with stability.

The MNIST dataset provides an ideal playground for learning these concepts. It’s challenging enough to require a proper neural network, simple enough to train quickly and understand deeply.

Building an MLP from scratch reveals that neural networks, despite their complexity and power, are built from understandable components following logical principles. The "magic" is really just mathematics, implemented carefully and trained with patience.

Moving Forward

If you've followed along this far, you now understand how neural networks learn from data. The next step in your journey might be implementing this yourself. Try coding a simple MLP from scratch, train it on MNIST, and watch its accuracy improve over epochs. Check here on my GitHub page, a guided MNIST implementation.

Neural networks have transformed artificial intelligence over the past decade. They power everything from image recognition to language translation to game-playing systems that beat human champions. All of these applications build on the same fundamental principles you've learned today.

The field continues evolving rapidly, with new architectures and training techniques emerging constantly. But the foundation remains unchanged: layers of neurons, forward propagation, loss calculation, and backpropagation. Master these concepts, and you're prepared to understand and apply whatever comes next in this exciting field. Be curious.

Understanding Multi-Layer Perceptrons