twitter: https://x.com/kabir_j25
The main idea behind Dropout is to bring regularization into the training process.
We know model averaging(bagging ensemble ) is a solid strategy to tackle overfitting. But, when it comes to training a bunch of big, heavy neural networks to form an ensemble, it’s a whole other story—it gets insanely expensive!
Here are the usual options:
Even if you somehow pull this off, combining these models at test time is just not practical for real-time use.
Dropout is a technique that tackles both of these issues. It can be thought of as a method making bagging practical for ensembles of many very large neural networks. It lets you train multiple networks without crushing your computational resources. Plus, it offers an efficient way to approximate combining tons of networks.
Dropout refers to dropping out units during training.
This means temporarily removing a node along with all its connections—both incoming and outgoing. What’s left is a thinned network, which includes only the nodes that survive the dropout process. Each node has a fixed probability of being retained during training:
Importantly, this sampling is done only for the input and hidden layers. Dropping out output nodes isn’t an option—it would make predictions and loss calculations impossible.
How does it work? Simple: when a node is dropped, its output value is multiplied by 0 , effectively removing it from the network.

(Figure (1)): ensemble of thinned networks. Since n=4 here, there can be a maximum of 16 thinned networks/subnetworks Ref: https://www.deeplearningbook.org/
<aside> 💡
Given a total of $n$ nodes, what are the total number of **thinned networks (**sampled networks) that can be formed? $2^n$
</aside>